Skip to main content

Everyday Conversations: A Comparative Study of Expert Transcriptions and ASR Outputs at a Lexical Level

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Abstract

The study examines the outcomes of automatic speech recognition (ASR) applied to field recordings of daily Russian speech. Everyday conversations, captured in real-life communicative scenarios, pose quite a complex subject for ASR. This is due to several factors: they can contain speech from a multitude of speakers, the loudness of the conversation partners’ speech signals fluctuates, there’s a substantial volume of overlapping speech from two or more speakers, and significant noise interferences can occur periodically. The presented research compares transcripts of these recordings produced by two recognition systems: the NTR Acoustic Model and OpenAI’s Whisper. These transcripts are then contrasted with expert transcription of the same recordings. The comparison of three frequency lists (the expert transcription, the acoustic model, and Whisper) reveals that each model has its unique characteristics at the lexical level. At the same time, both models perform worse in recognizing the following groups of words typical for spontaneous unprepared dialogues: discursive words, pragmatic markers, backchannel responses, interjections, conversational reduced word forms, and hesitations. These findings aim to foster improvements in ASR systems designed to transcribe conversational speech, such as work meetings and daily life dialogues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Hereinafter, the code of the macro-episode of the ORD corpus.

References

  1. Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 2: Short Papers. Melbourne, Australia. Association for Computational Linguistics, pp. 20–24 (2018). https://aclanthology.org/P18-2004.pdf

  2. AntConc. http://www.laurenceanthony.net/software.html. Accessed 1 Sept 2023

  3. Asinovsky, A., Bogdanova, N., Rusakova, M., Stepanova, S., Ryko, A., Sherstinova, S.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2009. Lecture Notes in Computer Science, vol 5729, pp. 250–257. Springer, Berlin, Heidelberg (2009)

    Google Scholar 

  4. Bakhturina, E., Lavrukhin, V., Ginsburg, B.: A toolbox for construction and analysis of speech datasets. Proc. Neural Inf. Process. Syst. Track Datasets Benchmarks, 1 (2021)

    Google Scholar 

  5. Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Ermolova, O., Baeva, E., Martynenko, G., Ryko, A.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. Ronzhin, A., et al. (eds.), SPECOM 2016, Lecture Notes in Artificial Intelligence, LNAI, 9811, pp. 659–666. Springer, Switzerland (2016)

    Google Scholar 

  6. Campbell, N.: Speech & expression; the value of a longitudinal corpus. LREC 2004, 183–186 (2004)

    Google Scholar 

  7. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)

  8. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 5036–5040 (2020)

    Google Scholar 

  9. Hellwig, B., Van Uytvanck, D., Hulsbosch, M., et al.: ELAN — Linguistic Annotator. Version 4.9.3 [in:] http://tla.mpi.nl/tools/tla-tools/elan/ Linguistic Annotator ELAN. https://tla.mpi.nl/tools/tla-tools/elan/

  10. Hendrycks, D., Gimpel, K.: Gaussian Error Linear Units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  11. https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_ru_conformer_ctc_large

  12. https://github.com/sovaai/sova-dataset

  13. https://voice.mozilla.org

  14. Karpov, N., Denisenko, A., Minkin, F.: Golos: Russian dataset for speech research. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH 2, 1076–1080 (2021). https://doi.org/10.21437/Interspeech.2021-462

  15. Kolobov, R., et al.: Mediaspeech: Multilanguage ASR Benchmark and Dataset. arXiv, arXiv:2103.16193 (2021)

  16. Morris, A.C., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: Eighth International Conference on Spoken Language Processing, pp. 2765–2768 (2004). https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2004/i04_2765.pdf?ref=https://githubhelp.com

  17. One Speech Day corpus online. https://ord.spbu.ru/

  18. Press, O., Wolf, L.: Using the output embedding to improve language models. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics (2017)

    Google Scholar 

  19. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202, 28492–28518 (2023)

    Google Scholar 

  20. Reference Guide for the British National Corpus. http://www.natcorp.ox.ac.uk/docs/URG.xml

  21. Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. Ronzhin, A., et al. (eds.) SPECOM 2015. Lecture Notes in Artificial Intelligence, LNAI. Vol. 9319. Springer, Switzerland, pp. 268–276 (2015)

    Google Scholar 

  22. Sherstinova, T.: The structure of the ORD speech corpus of Russian everyday communication. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 258–265. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_37

    Chapter  Google Scholar 

  23. Von Neumann, T., Boeddeker, C., Kinoshita, K., Delcroix, M., Haeb-Umbach, R.: On word error rate definitions and their efficient computation for multi-speaker speech recognition systems. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1–5 (2023). https://ieeexplore.ieee.org/abstract/document/10094784/

  24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5999–6009 (2017)

    Google Scholar 

  25. Bogdanova-Beglarian, N., Blinova, O., Sherstinova, T., Troshchenkova, E.: Russian pragmatic markers database: developing speech technologies for everyday spoken discourse. In: 26th Conference of Open Innovations Association FRUCT, FRUCT 2020; Yaroslavl; Russian Federation, pp. 60–66 (2020)

    Google Scholar 

  26. Bogdanova-Beglarian, N., Sherstinova, T., Kisloshchuk, A.: On the rhythm-forming function of discursive units. Perm University Herald. Russian and Foreign Philology 2(22), 7–17 (2013)

    Google Scholar 

  27. Holtzman, A., et al.: The curious case of neural text degeneration. In: Proceedings of the 2020 International Conference on Learning Representations, p. 2540 (2020)

    Google Scholar 

Download references

Acknowledgements

This article is an output of a research project “Text as Big Data: Modeling Convergent Processes in Language and Speech using Digital Methods” implemented as part of the Basic Research Program at the National Research University Higher School of Economics (HSE University).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Sherstinova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sherstinova, T., Kolobov, R., Mikhaylovskiy, N. (2023). Everyday Conversations: A Comparative Study of Expert Transcriptions and ASR Outputs at a Lexical Level. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48309-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48308-0

  • Online ISBN: 978-3-031-48309-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics