Skip to main content
Log in

Utterance partitioning for speaker recognition: an experimental review and analysis with new findings under GMM-SVM framework

International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The performance of speaker recognition system is highly dependent on the duration of speech used in enrollment and test. This work presents a detailed experimental review and analysis of the GMM-SVM based speaker recognition system in presence of duration variability. This article also reports a comparison of the performance of GMM-SVM classifier with its precursor technique Gaussian mixture model- universal background model (GMM-UBM) classifier in presence of duration variability. The goal of this research work is not to propose a new algorithm for improving speaker recognition performance in presence of duration variability. However, the main focus of this work is on utterance partitioning (UP), a commonly used strategy to compensate the duration variability issue. We have analysed in detailed the impact of training utterance partitioning in speaker recognition performance under GMM-SVM framework. We further investigate the reason why the utterance partitioning is important for boosting speaker recognition performance. We have also shown in which case the utterance partitioning could be useful and where not. Our study has revealed that utterance partitioning does not reduce the data imbalance problem of the GMM-SVM classifier as claimed in earlier study. Apart from these, we also discuss issues related to the impact of parameters such as number of Gaussians, supervector length, amount of splitting required for obtaining better performance in short and long duration test conditions from speech duration perspective. We have performed the experiments with telephone speech from POLYCOST corpus consisting of 130 speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

References

  • Alpaydin, E. (2004). Introduction to machine learning (2nd ed.). Cambridge: MIT Press.

    MATH  Google Scholar 

  • Bilmes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Tech. Rep. ICSI-TR-97–021, Department of Electrical Engineering and Computer Science,U.C. Berkeley. pp. 1–15.

  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006a). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.

    Article  Google Scholar 

  • Campbell,W.M., Sturim, D.E., Reynolds, D.A. & Solomonoff, A. (2006b). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: ICASSP06, vol. 1, pp 97–100.

  • Chakroborty, S. (2008). Some studies on acoustic feature extraction, feature selection and multi-level fusion strategies for robust text-independent speaker identification. Ph.D. Thesis, department of electronics and electrical communication engineering, IIT Kharagpur, India.

  • Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: A Library for Support Vector Machines. [Online]. http://www.csie.ntu.edu.tw/~cjlin/libsvm

  • Davis, S. B., & Mermelsteine, P. (1980). Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions Acousting, Speech, Signal Processing ASSP, 28(4), 357–365.

    Article  Google Scholar 

  • Dehak, N., Chollet, G. (2006). Support vector GMMs for speaker verification. In: Proc. IEEE Odyssey: the Speaker and Language Recognition Workshop (Odyssey 2006), San Juan, Puerto Rico, June 2006.

  • Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Fauve, B., Evans, N., Pearson, N., Bonastre, J.-F., Mason, J. (2007). Influence of task duration in text-independent speaker verification. In: Proc. Interspeech2007, Antwerp, Belgium, pp. 794–797.

  • Hansen, J. H., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99.

    Article  Google Scholar 

  • Hautamäki, R. G., Sahidullah, M., Hautamäki, V., & Kinnunen, T. (2017). Acoustical and perceptual study of voice disguise by age modification in speaker verification. Speech Communication, 95, 1–15.

    Article  Google Scholar 

  • Kanagasundaram, A., Dean, D., Sridharan, S., Ghaemmaghami, H., & Fookes, C. (2017). A study on the effects of using short utterance length development data in the design of GPLDA speaker verification systems. International Journal of Speech Technology, 20(2), 247–259.

    Article  Google Scholar 

  • Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82.

    Article  Google Scholar 

  • Kandali, A. B. (2012). Classification of discrete emotions in speech using prosodic and spectral features: Intra and cross-lingual studies in five native languages of Assam. Ph.D. Thesis, department of electrical engineering, IIT Kharagpur, India.

  • Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.

    Article  Google Scholar 

  • Kinnunen, T. (2004). Spectral features for automatic text-independent speaker recognition. Ph.D. Thesis, University of Joensuu.

  • Kinnunen, T., Saastamoinen, J., Hautamäki, V., Vinni, M., & Franti, P. (2009). Comparative evaluation of maximum a posteriori vector quantization and Gaussian mixture models in speaker verification. Pattern Recognition Letters., 30(4), 341–347.

    Article  Google Scholar 

  • Mak, M. W., & Rao, W. (2011). Utterance partitioning with acoustic vector resampling for GMM–SVM speaker verification. Speech Communication, 53(1), 119–130.

    Article  Google Scholar 

  • Matějka, P., Glembek, O., Castaldo, F., Alam, M.J., Plchot, O., Kenny, P., Burget, L. and Černocky, J. (May 2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4828–4831).

  • Patil, H. A. (2005). Speaker Recognition in Indian Languages: A Feature Based Approach. Ph.D. Thesis, department of electrical engineering, IIT Kharagpur, India.

  • Petrovska, D., et al. (1998). POLYCOST: A Telephonic speech database for speaker recognition. RLA2C, Avignon, France, April 20–23, pp. 211–214.

  • Poddar, A., Sahidullah, M., & Saha, G. (2017). Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 7(2), 91–101.

    Article  Google Scholar 

  • Rao, W., & Mak, M. W. (2013). Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Transactions on Audio, Speech, and Language Processing, 21(5), 1012–1022.

    Article  Google Scholar 

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.

    Article  Google Scholar 

  • Sahidullah, Md. (2015). Enhancement of speaker recognition performance using block level, relative, and temporal information of subband energies. Ph.D. Thesis, Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, India.

  • Sahidullah, Md., & Saha, G. (2012). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication., 54(4), 543–565.

    Article  Google Scholar 

  • Sen, N. (2014). Enhancement of speaker recognition performance for short test segments using GMM-SVM and polynomial classifiers. Ph.D. Thesis, Centre for Educational Technology, IIT Kharagpur, India.

  • Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S., 2018, April. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329–5333).

  • Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag.

    Book  Google Scholar 

Download references

Acknowledgements

The authors are grateful to Professor Goutam Saha, Department of E & ECE, IIT Kharagpur for his help in the experimentation with the POLYCOST database. First author is extremely grateful to Dr. Richa Mittal, erstwhile student of Department of CET, IIT Kharagpur for her help at the time of preparation of the manuscript. First author is also extremely grateful to Dr. Rahul Dasgupta, erstwhile student of Department of CET, IIT Kharagpur for rigorous technical discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Sahidullah.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sen, N., Sahidullah, M., Patil, H.A. et al. Utterance partitioning for speaker recognition: an experimental review and analysis with new findings under GMM-SVM framework. Int J Speech Technol 24, 1067–1088 (2021). https://doi.org/10.1007/s10772-021-09862-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09862-8

Keywords

Navigation