Skip to main content

RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques

  • Conference paper
  • First Online:
Computational Science – ICCS 2023 (ICCS 2023)

Abstract

The incredible advances in machine learning have created a variety of predictive and decision-making medical models that greatly improve the efficacy of treatment and improve the quality of care. In healthcare, such models are often based on electronic health records (EHRs). The quality of this models depends on the quality of the EHRs, which are usually presented as plain unstructured text. Such records often contain spelling errors, which reduce the quality of intelligent systems based on them. In this paper we present a method and tool for correcting spelling errors in medical texts in Russian. By combining the Symmetrical Deletion algorithm and a finely tuned BERT model to correct spelling errors, the tool can improve the quality of original medical texts without significant cost. We have evaluated the correction precision and performance of the presented tool and compared it with other popular spelling error correction tools that support Russian language. Experiments have shown that the presented approach and tool are 7% superior to existing open-source tools for automatically correcting spelling errors in Russian medical texts. The proposed tool and its source code are available on GitHub\(^{1}\) and pip\(^{2}\) repositories(\(^1\)https://github.com/DmitryPogrebnoy/MedSpellChecker \(^{2}\)https://pypi.org/project/medspellchecker).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/pymorphy2/pymorphy2.

  2. 2.

    https://huggingface.co/sberbank-ai/ruRoberta-large.

  3. 3.

    https://huggingface.co/DmitryPogrebnoy/MedRuRobertaLarge.

  4. 4.

    https://huggingface.co/distilbert-base-multilingual-cased.

  5. 5.

    https://huggingface.co/DmitryPogrebnoy/distilbert-base-russian-cased.

  6. 6.

    https://huggingface.co/DmitryPogrebnoy/MedDistilBertBaseRuCased.

  7. 7.

    https://huggingface.co/cointegrated/rubert-tiny2.

  8. 8.

    https://huggingface.co/DmitryPogrebnoy/MedRuBertTiny2.

  9. 9.

    https://pypi.org/project/medspellchecker.

References

  1. Abdaoui, A., Pradel, C., Sigel, G.: Load what you need: Smaller versions of mutililingual BERT. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 119–123. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.sustainlp-1.16

  2. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019). https://doi.org/10.18653/v1/W19-1909

  3. Balabaeva, K., Funkner, A., Kovalchuk, S.: Automated spelling correction for clinical text mining in russian. Stud. Health Technol. Inform. 270, 43–47 (2020). https://doi.org/10.3233/SHTI200119

    Article  Google Scholar 

  4. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964). https://doi.org/10.1145/363958.363994

    Article  Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). https://arxiv.org/abs/1810.04805

  6. Github repository of symspell tool (2018). https://github.com/wolfgarbe/SymSpell

  7. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016). https://arxiv.org/abs/1607.01759

  8. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019). https://doi.org/10.1093/bioinformatics/btz682

    Article  Google Scholar 

  9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). https://arxiv.org/abs/1301.3781

  10. Norvig, P.: Peter norvig’s blog post about a simple spell checking algorithm (2007). https://norvig.com/spell-correct.html

  11. Pavel, B., Aleksandr, N., Galina, Z., Arina, R., Vladimir, K., Chaitanya, S.: Rumednli: A russian natural language inference dataset for the clinical domain (2022). http://doi.org/10.13026/gxzd-cf80

  12. Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets (2019). https://arxiv.org/abs/1906.05474

  13. Romanov, A., Shivade, C.: Lessons from natural language inference in the clinical domain (2018). https://arxiv.org/abs/1808.06752

  14. Shelmanov, A.O., Smirnov, I.V., Vishneva, E.A.: Information extraction from clinical texts in russian. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference "Dialogue" , vol. 270, pp. 537–549 (2015)

    Google Scholar 

  15. Sorokin, A., Baytin, A., Galinskaya, I., Rykunova, E., Shavrina, T.: Spellrueval : the first competition on automatic spelling correction for russian (2016). https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf

  16. Starovoitova, E., et al.: RuMedPrimeData (2021). https://doi.org/10.5281/zenodo.5765873

  17. Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 144–151. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073109

  18. Yalunin, A., Nesterov, A., Umerenkov, D.: Rubioroberta: a pre-trained biomedical language model for russian language biomedical text mining (2022). https://arxiv.org/abs/2204.03951

Download references

Acknowledgments

This work was supported by the Ministry of Science and Higher Education of Russian Federation, goszadanie no. 2019-1339.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitrii Pogrebnoi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pogrebnoi, D., Funkner, A., Kovalchuk, S. (2023). RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 10475. Springer, Cham. https://doi.org/10.1007/978-3-031-36024-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36024-4_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36023-7

  • Online ISBN: 978-3-031-36024-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics