RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques

Pogrebnoi, Dmitrii; Funkner, Anastasia; Kovalchuk, Sergey

doi:10.1007/978-3-031-36024-4_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 10475))

Included in the following conference series:

International Conference on Computational Science

516 Accesses

Abstract

The incredible advances in machine learning have created a variety of predictive and decision-making medical models that greatly improve the efficacy of treatment and improve the quality of care. In healthcare, such models are often based on electronic health records (EHRs). The quality of this models depends on the quality of the EHRs, which are usually presented as plain unstructured text. Such records often contain spelling errors, which reduce the quality of intelligent systems based on them. In this paper we present a method and tool for correcting spelling errors in medical texts in Russian. By combining the Symmetrical Deletion algorithm and a finely tuned BERT model to correct spelling errors, the tool can improve the quality of original medical texts without significant cost. We have evaluated the correction precision and performance of the presented tool and compared it with other popular spelling error correction tools that support Russian language. Experiments have shown that the presented approach and tool are 7% superior to existing open-source tools for automatically correcting spelling errors in Russian medical texts. The proposed tool and its source code are available on GitHub\(^{1}\) and pip\(^{2}\) repositories(\(^1\)https://github.com/DmitryPogrebnoy/MedSpellChecker \(^{2}\)https://pypi.org/project/medspellchecker).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abdaoui, A., Pradel, C., Sigel, G.: Load what you need: Smaller versions of mutililingual BERT. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 119–123. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.sustainlp-1.16
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019). https://doi.org/10.18653/v1/W19-1909
Balabaeva, K., Funkner, A., Kovalchuk, S.: Automated spelling correction for clinical text mining in russian. Stud. Health Technol. Inform. 270, 43–47 (2020). https://doi.org/10.3233/SHTI200119
Article Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964). https://doi.org/10.1145/363958.363994
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). https://arxiv.org/abs/1810.04805
Github repository of symspell tool (2018). https://github.com/wolfgarbe/SymSpell
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016). https://arxiv.org/abs/1607.01759
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019). https://doi.org/10.1093/bioinformatics/btz682
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). https://arxiv.org/abs/1301.3781
Norvig, P.: Peter norvig’s blog post about a simple spell checking algorithm (2007). https://norvig.com/spell-correct.html
Pavel, B., Aleksandr, N., Galina, Z., Arina, R., Vladimir, K., Chaitanya, S.: Rumednli: A russian natural language inference dataset for the clinical domain (2022). http://doi.org/10.13026/gxzd-cf80
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets (2019). https://arxiv.org/abs/1906.05474
Romanov, A., Shivade, C.: Lessons from natural language inference in the clinical domain (2018). https://arxiv.org/abs/1808.06752
Shelmanov, A.O., Smirnov, I.V., Vishneva, E.A.: Information extraction from clinical texts in russian. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference "Dialogue" , vol. 270, pp. 537–549 (2015)
Google Scholar
Sorokin, A., Baytin, A., Galinskaya, I., Rykunova, E., Shavrina, T.: Spellrueval : the first competition on automatic spelling correction for russian (2016). https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf
Starovoitova, E., et al.: RuMedPrimeData (2021). https://doi.org/10.5281/zenodo.5765873
Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 144–151. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073109
Yalunin, A., Nesterov, A., Umerenkov, D.: Rubioroberta: a pre-trained biomedical language model for russian language biomedical text mining (2022). https://arxiv.org/abs/2204.03951

Download references

Acknowledgments

This work was supported by the Ministry of Science and Higher Education of Russian Federation, goszadanie no. 2019-1339.

Author information

Authors and Affiliations

ITMO University, Saint Petersburg, Russia
Dmitrii Pogrebnoi, Anastasia Funkner & Sergey Kovalchuk

Authors

Dmitrii Pogrebnoi
View author publications
You can also search for this author in PubMed Google Scholar
Anastasia Funkner
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Kovalchuk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitrii Pogrebnoi .

Editor information

Editors and Affiliations

Czech Technical University in Prague, Prague, Czech Republic
Jiří Mikyška
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pogrebnoi, D., Funkner, A., Kovalchuk, S. (2023). RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 10475. Springer, Cham. https://doi.org/10.1007/978-3-031-36024-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-36024-4_16
Published: 26 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36023-7
Online ISBN: 978-3-031-36024-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques