Skip to main content

Applying EEND Diarization to Telephone Recordings from a Call Center

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Abstract

In this paper, we focus on the issue of speaker diarization of data from a real call center. We have previously proposed a specialized solution to the problem, which employed additional knowledge about the identities of the phone operators (in our case, the language counselors from the Language Consulting Center), thus improving performance over the baseline. But a recent end-to-end diarization method, EEND, has since proven very successful on other data and was shown to surpass the previous state of the art in the field. Thus, we chose to compare this new method with our own previous approach. Using an existing implementation of the EEND method (adapted using a small amount of the target data from the Language Consulting Center), we successfully surpass the performance of our previous approach (17.42% vs. 19.39% DER), without the need for any additional information about speaker identities. The majority of the remaining diarization error of the EEND system is due to incorrect decisions between speech and silence, rather than speaker confusion. For comparison, we also show the results of a more standard diarization approach, represented by the method used in the Kaldi toolkit.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/hitachi-speech/EEND.

  2. 2.

    https://starfos.tacr.cz/en/project/DG16P02B009.

  3. 3.

    https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization/v2.

  4. 4.

    https://github.com/hitachi-speech/EEND.

References

  1. Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)

    Google Scholar 

  2. Diez, M., Burget, L., Matejka, P.: Speaker Diarization based on Bayesian HMM with Eigenvoice priors. In: Odyssey - Speaker and Language Recognition Workshop, pp. 147–154 (2018)

    Google Scholar 

  3. Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006). https://doi.org/10.1007/11965152_28

    Chapter  Google Scholar 

  4. Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with permutation-free objectives. In: Interspeech, pp. 4300–4304 (2019)

    Google Scholar 

  5. Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with self-attention. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp. 296–303. Institute of Electrical and Electronics Engineers Inc., December 2019

    Google Scholar 

  6. Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., Nagamatsu, K.: End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In: Interspeech, pp. 269–273 (2020)

    Google Scholar 

  7. Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41

    Chapter  Google Scholar 

  8. Kinoshita, K., Delcroix, M., Tawara, N.: Integrating end-to-end neural and clustering-based diarization: getting the best of both worlds. Arxiv (2020)

    Google Scholar 

  9. Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)

    Google Scholar 

  10. Nagrani, A., et al.: VOXSRC 2020: The Second Voxceleb Speaker Recognition Challenge. Arxiv (2020)

    Google Scholar 

  11. Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A Review of Speaker Diarization: Recent Advances with Deep Learning. arXiv, January 2021

    Google Scholar 

  12. Povey, D., et al.: The Kaldi speech recognition toolkit. In: Workshop on Automatic Speech Recognition and Understanding. IEEE Catalog No.: CFP11SRW-USB, Hawaii (2011)

    Google Scholar 

  13. Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)

    Google Scholar 

  14. Ryant, N., et al.: The second DIHARD diarization challenge: dataset, task, and baselines. In: INTERSPEECH, Gratz (2019)

    Google Scholar 

  15. Ryant, N., et al.: The Third DIHARD Diarization Challenge. arXiv, p. 5 (2020)

    Google Scholar 

  16. Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)

    Google Scholar 

  17. Sell, G., Garcia-Romero, D.: Diarization resegmentation in the factor analysis subspace. In: ICASSP, pp. 4794–4798. IEEE, April 2015

    Google Scholar 

  18. Sell, G., et al.: Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Interspeech, Hyderabad, pp. 2808–2812 (2018)

    Google Scholar 

  19. Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P.: A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 217–227 (2014)

    Article  Google Scholar 

  20. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: ICASSP, pp. 5329–5333 (2018)

    Google Scholar 

  21. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural. Inf. Process. Syst. 4, 3104–3112 (2014)

    Google Scholar 

  22. Von Neumann, T., Kinoshita, K., Delcroix, M., Araki, S., Nakatani, T., Haeb-Umbach, R.: All-neural online source separation, counting, and diarization for meeting analysis. In: ICASSP, pp. 91–95 (2019)

    Google Scholar 

  23. Yin, R., Bredin, H., Barras, C.: Neural speech turn segmentation and affinity propagation for speaker diarization. In: INTERSPEECH, pp. 1393–1397. ISCA, Hyderabad (2018)

    Google Scholar 

  24. Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: ICASSP, pp. 241–245, July 2017

    Google Scholar 

  25. Zajíc, Z., Psutka, J.V., Müller, L.: Diarization based on identification with x-vectors. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 667–678. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_64

    Chapter  Google Scholar 

  26. Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 411–418. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_49

    Chapter  Google Scholar 

  27. Zajíc, Z., Psutka, J.V., Zajícová, L., Müller, L., Salajka, P.: Diarization of the language consulting center telephone calls. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 549–558. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_56

    Chapter  Google Scholar 

  28. Zajíc, Z., et al.: First insight into the processing of the language consulting center data. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 778–787. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_79

    Chapter  Google Scholar 

Download references

Acknowledgments

The work described herein has been supported by the Ministry of Education, Youth and Sports of the Czech Republic, Project No. LM2018101 LINDAT/CLARIAH-CZ. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the program “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zbyněk Zajíc .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zajíc, Z., Kunešová, M., Müller, L. (2021). Applying EEND Diarization to Telephone Recordings from a Call Center. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_72

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_72

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics