Directional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization

Astapov, Sergei; Popov, Dmitriy; Kabarov, Vladimir

doi:10.1007/978-3-030-60276-5_5

Directional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization

Sergei Astapov¹⁰,
Dmitriy Popov¹¹ &
Vladimir Kabarov¹⁰

Conference paper
First Online: 29 September 2020

1594 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Abstract

Lately developed approaches to distant speech processing tasks fail to reach the quality of close-talking speech processing in terms of speech recognition, speaker identification and diarization quality. Sound source localization remains an important aspect in multi-channel distant speech processing applications. This paper considers an approach to improve speaker localization quality on large-aperture microphone arrays. To reduce the shortcomings of signal acquisition with large-aperture arrays and reduce the impact of noise and interference, a Time-Frequency masking approach is proposed applying Complex Angular Central Gaussian Mixture Models for sound source directional clustering and inter-component phase analysis for polyharmonic speech component restoration. The approach is tested on real-life multi-speaker recordings and shown to increase speaker localization accuracy for the cases of non-overlapped and partially overlapped speech.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Astapov, S., Lavrentyev, A., Shuranov, E.: Far field speech enhancement at low SNR in presence of nonstationary noise based on spectral masking and MVDR beamforming. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 21–31. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_3
Chapter Google Scholar
Astapov, S., et al.: Acoustic event mixing to multichannel AMI data for distant speech recognition and acoustic event classification benchmarking. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 31–42. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_4
Chapter Google Scholar
Barysenka, S.Y., Vorobiov, V.I., Mowlaee, P.: Single-channel speech enhancement using inter-component phase relations. Speech Commun. 99, 144–160 (2018)
Article Google Scholar
Comanducci, L., Cobos, M., Antonacci, F., Sarti, A.: Time difference of arrival estimation from frequency-sliding generalized cross-correlations using convolutional neural networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949 (2020)
Google Scholar
Dey, N., Ashour, A.: Direction of Arrival Estimation and Localization of Multi-Speech Sources. Springer Briefs in Speech Technology. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-73059-2
Book Google Scholar
DiBiase, J.H.: A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays. Ph.D. thesis, Brown University, Providence, RI, USA (2000)
Google Scholar
Do, H., Silverman, H.F.: Stochastic particle filtering: a fast SRP-PHAT single source localization algorithm. In: 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 213–216 (2009)
Google Scholar
He, W., Lu, L., Zhang, B., Mahadeokar, J., Kalgaonkar, K., Fuegen, C.: Spatial attention for far-field speech recognition with deep beamforming neural networks. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7499–7503, May 2020
Google Scholar
Ito, N., Araki, S., Nakatani, T.: Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1153–1157 (2016)
Google Scholar
Kulmer, J., Mowlaee, P.: Phase estimation in single channel speech enhancement using phase decomposition. IEEE Signal Process. Lett. 22(5), 598–602 (2015)
Article Google Scholar
Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.C.: FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 260–267. IEEE, Piscataway, NJ (2020). IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019); Conference Location: Singapore, Singapore; Conference Date: December 14–18 (2019)
Google Scholar
Sachar, J.M.: Some Important Algorithms for Large-Aperture Microphone Arrays: Calibration and Determination of Talker Orientation. Ph.D. thesis, Brown University, Providence, RI, USA (2004)
Google Scholar
Silverman, H.F., Patterson, W.R., Sachar, J.: Factors affecting the performance of large-aperture microphone arrays. J. Acoust. Soc. Am. 111(5 Pt 1), 2140–2157 (2002)
Article Google Scholar
Vera-Diaz, J., Pizarro, D., Macias-Guarasa, J.: Towards end-to-end acoustic localization using deep learning: from audio signals to source position coordinates. Sensors 18, 3418 (2018)
Article Google Scholar
Vorobiov, V.I., Davydov, A.G.: Study of the relations between quasi-harmonic components of speech signal in Chinese language. Proc. Twenty-Fifth Session Russian Acoust. Soc. 3, 11–14 (2012)
Google Scholar
Watanabe, S., Araki, S., Bacchiani, M., Haeb-Umbach, R., Seltzer, M.L.: Introduction to the issue on far-field speech processing in the era of deep learning: speech enhancement, separation, and recognition. IEEE J. Sel. Top. Sig. Process. 13(4), 785–786 (2019)
Article Google Scholar
Xiao, X., Watanabe, S., Chng, E.S., Li, H.: Beamforming networks using spatial covariance features for far-field speech recognition. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6 (2016)
Google Scholar
Zhao, H., Zarar, S., Tashev, I., Lee, C.H.: Convolutional-recurrent neural networks for speech enhancement. In: IEEE International Conference Acoustics Speech and Signal Processing (ICASSP), April 2018
Google Scholar

Download references

Acknowledgments

This research was financially supported by the Foundation NTI (Contract 20/18gr, ID 0000000007418QR20002) and by the Government of the Russian Federation (Grant 08-08).

Author information

Authors and Affiliations

International Research Laboratory “Multimodal Biometric and Speech Systems,” ITMO University, Kronverksky prospekt 49A, St. Petersburg, 197101, Russia
Sergei Astapov & Vladimir Kabarov
Speech Technology Center, Vyborgskaya naberezhnaya 45E, St. Petersburg, 194044, Russia
Dmitriy Popov

Authors

Sergei Astapov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Popov
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Kabarov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergei Astapov .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Astapov, S., Popov, D., Kabarov, V. (2020). Directional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_5
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics