Everyday Conversations: A Comparative Study of Expert Transcriptions and ASR Outputs at a Lexical Level

Sherstinova, Tatiana; Kolobov, Rostislav; Mikhaylovskiy, Nikolay

doi:10.1007/978-3-031-48309-7_4

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

International Conference on Speech and Computer

502 Accesses

Abstract

The study examines the outcomes of automatic speech recognition (ASR) applied to field recordings of daily Russian speech. Everyday conversations, captured in real-life communicative scenarios, pose quite a complex subject for ASR. This is due to several factors: they can contain speech from a multitude of speakers, the loudness of the conversation partners’ speech signals fluctuates, there’s a substantial volume of overlapping speech from two or more speakers, and significant noise interferences can occur periodically. The presented research compares transcripts of these recordings produced by two recognition systems: the NTR Acoustic Model and OpenAI’s Whisper. These transcripts are then contrasted with expert transcription of the same recordings. The comparison of three frequency lists (the expert transcription, the acoustic model, and Whisper) reveals that each model has its unique characteristics at the lexical level. At the same time, both models perform worse in recognizing the following groups of words typical for spontaneous unprepared dialogues: discursive words, pragmatic markers, backchannel responses, interjections, conversational reduced word forms, and hesitations. These findings aim to foster improvements in ASR systems designed to transcribe conversational speech, such as work meetings and daily life dialogues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Hereinafter, the code of the macro-episode of the ORD corpus.

References

Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 2: Short Papers. Melbourne, Australia. Association for Computational Linguistics, pp. 20–24 (2018). https://aclanthology.org/P18-2004.pdf
AntConc. http://www.laurenceanthony.net/software.html. Accessed 1 Sept 2023
Asinovsky, A., Bogdanova, N., Rusakova, M., Stepanova, S., Ryko, A., Sherstinova, S.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2009. Lecture Notes in Computer Science, vol 5729, pp. 250–257. Springer, Berlin, Heidelberg (2009)
Google Scholar
Bakhturina, E., Lavrukhin, V., Ginsburg, B.: A toolbox for construction and analysis of speech datasets. Proc. Neural Inf. Process. Syst. Track Datasets Benchmarks, 1 (2021)
Google Scholar
Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Ermolova, O., Baeva, E., Martynenko, G., Ryko, A.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. Ronzhin, A., et al. (eds.), SPECOM 2016, Lecture Notes in Artificial Intelligence, LNAI, 9811, pp. 659–666. Springer, Switzerland (2016)
Google Scholar
Campbell, N.: Speech & expression; the value of a longitudinal corpus. LREC 2004, 183–186 (2004)
Google Scholar
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 5036–5040 (2020)
Google Scholar
Hellwig, B., Van Uytvanck, D., Hulsbosch, M., et al.: ELAN — Linguistic Annotator. Version 4.9.3 [in:] http://tla.mpi.nl/tools/tla-tools/elan/ Linguistic Annotator ELAN. https://tla.mpi.nl/tools/tla-tools/elan/
Hendrycks, D., Gimpel, K.: Gaussian Error Linear Units (gelus). arXiv preprint arXiv:1606.08415 (2016)
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_ru_conformer_ctc_large
https://github.com/sovaai/sova-dataset
https://voice.mozilla.org
Karpov, N., Denisenko, A., Minkin, F.: Golos: Russian dataset for speech research. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH 2, 1076–1080 (2021). https://doi.org/10.21437/Interspeech.2021-462
Kolobov, R., et al.: Mediaspeech: Multilanguage ASR Benchmark and Dataset. arXiv, arXiv:2103.16193 (2021)
Morris, A.C., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: Eighth International Conference on Spoken Language Processing, pp. 2765–2768 (2004). https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2004/i04_2765.pdf?ref=https://githubhelp.com
One Speech Day corpus online. https://ord.spbu.ru/
Press, O., Wolf, L.: Using the output embedding to improve language models. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics (2017)
Google Scholar
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202, 28492–28518 (2023)
Google Scholar
Reference Guide for the British National Corpus. http://www.natcorp.ox.ac.uk/docs/URG.xml
Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. Ronzhin, A., et al. (eds.) SPECOM 2015. Lecture Notes in Artificial Intelligence, LNAI. Vol. 9319. Springer, Switzerland, pp. 268–276 (2015)
Google Scholar
Sherstinova, T.: The structure of the ORD speech corpus of Russian everyday communication. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 258–265. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_37
Chapter Google Scholar
Von Neumann, T., Boeddeker, C., Kinoshita, K., Delcroix, M., Haeb-Umbach, R.: On word error rate definitions and their efficient computation for multi-speaker speech recognition systems. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1–5 (2023). https://ieeexplore.ieee.org/abstract/document/10094784/
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5999–6009 (2017)
Google Scholar
Bogdanova-Beglarian, N., Blinova, O., Sherstinova, T., Troshchenkova, E.: Russian pragmatic markers database: developing speech technologies for everyday spoken discourse. In: 26th Conference of Open Innovations Association FRUCT, FRUCT 2020; Yaroslavl; Russian Federation, pp. 60–66 (2020)
Google Scholar
Bogdanova-Beglarian, N., Sherstinova, T., Kisloshchuk, A.: On the rhythm-forming function of discursive units. Perm University Herald. Russian and Foreign Philology 2(22), 7–17 (2013)
Google Scholar
Holtzman, A., et al.: The curious case of neural text degeneration. In: Proceedings of the 2020 International Conference on Learning Representations, p. 2540 (2020)
Google Scholar

Download references

Acknowledgements

This article is an output of a research project “Text as Big Data: Modeling Convergent Processes in Language and Speech using Digital Methods” implemented as part of the Basic Research Program at the National Research University Higher School of Economics (HSE University).

Author information

Authors and Affiliations

HSE University, Saint Petersburg, Russia, 190068
Tatiana Sherstinova
NTR Labs, Moscow, Russia, 129594
Rostislav Kolobov & Nikolay Mikhaylovskiy
Higher IT School of Tomsk State University, Tomsk, Russia, 634050
Nikolay Mikhaylovskiy

Authors

Tatiana Sherstinova
View author publications
You can also search for this author in PubMed Google Scholar
Rostislav Kolobov
View author publications
You can also search for this author in PubMed Google Scholar
Nikolay Mikhaylovskiy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tatiana Sherstinova .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sherstinova, T., Kolobov, R., Mikhaylovskiy, N. (2023). Everyday Conversations: A Comparative Study of Expert Transcriptions and ASR Outputs at a Lexical Level. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-48309-7_4
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics