Speaker Modeling Using Emotional Speech for More Robust Speaker Identification

Milošević, M.; Nedeljković, Ž.; Glavitsch, U.; Đurović, Ž.

doi:10.1134/S1064226919110184

Speaker Modeling Using Emotional Speech for More Robust Speaker Identification

THEORY AND METHODS OF SIGNAL PROCESSING
Published: 18 November 2019

Volume 64, pages 1256–1265, (2019)
Cite this article

Journal of Communications Technology and Electronics Aims and scope Submit manuscript

M. Milošević¹,
Ž. Nedeljković¹,
U. Glavitsch² &
…
Ž. Đurović¹

111 Accesses
2 Citations
Explore all metrics

Abstract—

Automatic identity recognition in fast, reliable and non-intrusive way is one of the most challenging topics in digital world of today. A possible approach to identity recognition is the identification by voice. Characteristics of speech relevant for automatic speaker recognition can be affected by external factors such as noise and channel distortions, but also by speaker-specific conditions—emotional or health states. The improvement of a speaker recognition system by different model training strategies are addressed in this paper in order to obtain the best performance of the system with only a limited amount of neutral and emotional speech data. The models adopted are a Gaussian Mixture Model and i-vectors whose inputs are Mel Frequency Cepstral Coefficients, and the experiments have been conducted on the Russian Language Affective speech database. The results show that the appropriate use of emotional speech in speaker model training improves the robustness of a speaker recognition system – both when tested on neutral and emotional speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

REFERENCES

A. Alarifi, I. Alkurtass, and A.S. Alsalman, in Proc. Int. Conf. on Information Technology and e-Services, Sousse, Mar. 24–26,2012, pp. 1–6. https://doi.org/10.1109/ICITeS.2012.6216673
K. R. Scherer, T. Johnstone, G. Klasmeyer, and T. Bänziger, in Proc. Interspeech, Beijing, China, Oct. 16–20,2000. https://www.isca-speech.org/archive/icslp_ 2000/i00_2807.html.
J. H. Hansen and T. Hasan, IEEE Signal Process. Mag. 32, 74 (2015). https://doi.org/10.1109/MSP.2015.2462851
Article Google Scholar
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Dig. Signal Process. 10, 19 (2000). https://doi.org/10.1006/dspr.1999.0361
Article Google Scholar
W. M. Campbell, D. E. Sturim, and D. A. Reynolds, IEEE Signal Process. Lett. 13, No. 5, 308 (2006). https://doi.org/10.1109/LSP.2006.870086
Article Google Scholar
P. Kenny, Tech. Rep. CRIM-06/08-13, (2005). http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.331.7996.
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, IEEE Trans. Audio, Speech, Lang. Proc. 19, 788, (2010). https://doi.org/10.1109/TASL.2010.2064307
Article Google Scholar
T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Bořil, J. H. L. Hansen, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP),2013 (IEEE, New York, 2013), p. 6783. https://doi.org/10.1109/ICASSP.2013.6638975
R. Saedi, K. A. Lee, T. Kinnunen, T. Hasan, B. Fauve, P. M. Bousquet, E. Khoury, P. L. S. Martinez, J. M. K. Kua, C. You, and H. Sun, in Proc.Interspeech,Lyon, France, August 25–29, 2013, p. 1986. https://www.isca-speech.org/archive/archive_papers/ inters peech_2013/i13_1986.pdf.
Google Scholar
C. Zhang, F. Bahmaninezhad, S. Ranjan, C. Yu, N. Shokouhi, and J. H. L. Hansen, in Proc. Interspeech, Stockholm, Sweden, August 20–24, (2017), p. 1343. https://doi.org/10.21437/Interspeech.2017-555
K. A. Lee, V. Hautamaki, T. Kinnunen, A. Larcher, C. Zhang, A. Nautsch, T. Stafylakis, G. Liu, M. Rouvier, W. Rao, F. Alegre, J. Ma, M. W. Mak, A. K. Sarkar, H. Delgado, R. Saeidi, H. Aronowitz, A. Sizov, H. Sun, T. H. Nguyen, Md. Sahidullah, V. Vestman, M. Halonen, A. Kanervisto et al., in Proc. Interspeech, Stockholm, Sweden, August 20–24, (2017), p. 1328. https://doi.org/10.21437/Interspeech.2017-203
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, in Proc.Interspeech,Lisbon, Portugal, (2005), p. 1517. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8506&rep=rep1&type=pdf.
M. Liberman et al. “Emotional Prosody Speech and Transcripts LDC2002S28,” Web Download (2002). https://catalog.ldc.upenn.edu/LDC2002S28.
C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, and S. S. Narayanan, Lang. resources and eval. 42, 335 (2008). https://doi.org/10.1007/s10579-008-9076-6
Article Google Scholar
S. G. Koolagudi, S. Maity, V. A. Kumar, S. Chakrabarti, K. S. Rao, Comm. Comp. & Inf. Sci. (CCIS) 40, 485 (2009). https://doi.org/10.1007/978-3-642-03547-0
Google Scholar
T. Wu, Y. Yang, Z. Wu, and D. Li, IEEE Speaker and Lang. Recognition Workshop (IEEE, Odyssey, 2006), p. 1. https://doi.org/10.1109/ODYSSEY.2006.248084
V. Markova and V.A. Pertushin, in Proc.Int. Conf. Spoken Language Processing,Denver, Colorado, USA, Sept. 16–20,2002, p. 2041. https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_2041.pdf.
L. Ferrer, M. McLaren, N. Scheffer, Y. Lei, M. Graciarena & V. Mitra, Sri Int. Menlo Park Ca Speech Tech. and Research Lab. (2013). https://www.semanticscholar.org/paper/A-noise-robust-system-for-NIST-2012-speaker-Ferrer-McLaren/17d2fef89a3c3dd5d291bc0906c9a3d047 c50609.
D. Li, Y. Yang, and W. Dai, Sci. World J. 2014 (2014), https://doi.org/10.1155/2014/628516
Google Scholar
K.R. Alluri, S. Achanta, R. Prasath, S. V. Gangashetty, and A. K. Vuppala, in Proc. Int. Conf. on Mining Intelligence and Knowledge Exploration,2016, p. 66. https://doi.org/10.1007/978-3-319-58130-9_7
Chapter Google Scholar
M. V. Ghiurcau, C. Rusu, C., and J. Astola, in Proc IEEE Int. Conf. on Acoustics, Speech and Sig.,2011 (IEEE, New York, 2011), p. 4944. https://doi.org/10.1109/ICASSP.2011.5947465
A. Mansour and Z. Lachiri, in Proc.Eng. Tech. Second Int. Conf. on Automation, Control, Eng. and Comp. Sci., 2016, p. 122. http://ipco-co.com/CSCIT_Journal/papers-CSCIT/CSCIT/CSCIT%20-%20Vol.2%20-%20 issue1%20-%202015/1.pdf.
Google Scholar
A. Mansour and Z. Lachiri, Int. Journal of Adv. Comp. Sci. and Applications 8, 538 (2017). https://doi.org/10.14569/IJACSA.2017.080471
Article Google Scholar
T. Wu, Y. Yang, and Z. Wu, Int. Conf. on Affective Comp. and Intelligent Interaction, 382(2005). https://doi.org/10.1007/11573548_49
Chapter Google Scholar
L. Chen and Y. Yang, Biometric Recognition Lect. Notes in Comp. Sci. 8232, 394 (2013). https://doi.org/10.1007/978-3-319-02961-0_49
Article Google Scholar
L. Chen and Y. Yang, Biometric Recognition Lect. Notes in Comp. Sci. 7098, 174 (2011). https://doi.org/10.1007/978-3-642-25449-9_22
Article Google Scholar
D.A. Reynolds and R.C. Rose, IEEE Trans. Speech Audio Process. 3, 72 (1995). https://doi.org/10.1109/89.365379
Article Google Scholar
D. Reynolds, Encyclopedia of Biometrics (Springer, Boston, MA, 2015), pp. 827–832. https://doi.org/10.1007/978-1-4899-7488-4
Google Scholar
S. Davis and P. Mermelstein, IEEE Trans. Acoust., Speech, Signal Process. 28, 357 (1990). https://doi.org/10.1016/B978-0-08-051584-7.50010-3
Article Google Scholar
T. Kinnunen, H. Li, Speech Comm. 52, 12 (2010), https://doi.org/10.1016/j.specom.2009.08.009
Article Google Scholar
V. Dellwo, A. Leemann, J. Kolly, in Proc. Interspeech, Portland, USA, Sept. 9–13,2012, pp. 1584–1587. https://doi.org/10.5167/uzh-68554
I. Jokic, V. Delic, S. Jokic, and Z. Peric, Adv. in Electron. & Comp. Eng. 15, 25 (2015). https://doi.org/10.4316/AECE.2015.04004
Article Google Scholar
S. O. Sadjadi, M. Slaney, and L. Heck, Speech and Lang. Process. Tech. Comm. Nwl. 1, (4), 1 (2013). http://www.microsoft.com/en-us/research/wp-content/ uploads/2013/09/MSR-Identity-Toolbox-v1_1.pdf.
M. Brookes, (1997). http://www.ee.ic.ac.uk/hp/staff/ dmb/voicebox/voicebox.html.
I. Sergey, in Proc. Eur. Conf. on Comp. Vision, Graz, Austria, May 7–13,2006, p. 3954. https://doi.org/10.1007/11744085_4
Chapter Google Scholar
P. Ekman, Cognition & Emotion, 6, 169 (1992), https://doi.org/10.1080/02699939208411068
Article Google Scholar
M. Bojanic, M. Gnjatovic, M. Secujski, and V. Delic, in Proc. IEEE Int. Symp. on Intelligent Sys. and Inf.,2013 (IEEE, 2013), p. 353. https://doi.org/10.1109/SISY.2013.6662601
J. Posner, J. A. Russell, and B. S. Peterson, Dev. Psychol. 17, 715 (2005).https://doi.org/10.1017/S0954579405050340
Article Google Scholar

Download references

ACKNOWLEDGMENTS

Authors would like to thank V. A. Petrushin for access to RUSLANA database and V. Dellwo for access to TVOID database.

Author information

Authors and Affiliations

School of Electrical Engineering, University of Belgrade, 11000, Belgrade, Serbia
M. Milošević, Ž. Nedeljković & Ž. Đurović
GlavitschEggler Software, 5400, Baden, Switzerland
U. Glavitsch

Authors

M. Milošević
View author publications
You can also search for this author in PubMed Google Scholar
Ž. Nedeljković
View author publications
You can also search for this author in PubMed Google Scholar
U. Glavitsch
View author publications
You can also search for this author in PubMed Google Scholar
Ž. Đurović
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Milošević.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Milošević, M., Nedeljković, Ž., Glavitsch, U. et al. Speaker Modeling Using Emotional Speech for More Robust Speaker Identification. J. Commun. Technol. Electron. 64, 1256–1265 (2019). https://doi.org/10.1134/S1064226919110184

Download citation

Received: 21 June 2019
Revised: 09 July 2019
Accepted: 11 July 2019
Published: 18 November 2019
Issue Date: November 2019
DOI: https://doi.org/10.1134/S1064226919110184

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions