Abstract
The incredible advances in machine learning have created a variety of predictive and decision-making medical models that greatly improve the efficacy of treatment and improve the quality of care. In healthcare, such models are often based on electronic health records (EHRs). The quality of this models depends on the quality of the EHRs, which are usually presented as plain unstructured text. Such records often contain spelling errors, which reduce the quality of intelligent systems based on them. In this paper we present a method and tool for correcting spelling errors in medical texts in Russian. By combining the Symmetrical Deletion algorithm and a finely tuned BERT model to correct spelling errors, the tool can improve the quality of original medical texts without significant cost. We have evaluated the correction precision and performance of the presented tool and compared it with other popular spelling error correction tools that support Russian language. Experiments have shown that the presented approach and tool are 7% superior to existing open-source tools for automatically correcting spelling errors in Russian medical texts. The proposed tool and its source code are available on GitHub\(^{1}\) and pip\(^{2}\) repositories(\(^1\)https://github.com/DmitryPogrebnoy/MedSpellChecker \(^{2}\)https://pypi.org/project/medspellchecker).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Abdaoui, A., Pradel, C., Sigel, G.: Load what you need: Smaller versions of mutililingual BERT. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 119–123. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.sustainlp-1.16
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019). https://doi.org/10.18653/v1/W19-1909
Balabaeva, K., Funkner, A., Kovalchuk, S.: Automated spelling correction for clinical text mining in russian. Stud. Health Technol. Inform. 270, 43–47 (2020). https://doi.org/10.3233/SHTI200119
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964). https://doi.org/10.1145/363958.363994
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). https://arxiv.org/abs/1810.04805
Github repository of symspell tool (2018). https://github.com/wolfgarbe/SymSpell
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016). https://arxiv.org/abs/1607.01759
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019). https://doi.org/10.1093/bioinformatics/btz682
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). https://arxiv.org/abs/1301.3781
Norvig, P.: Peter norvig’s blog post about a simple spell checking algorithm (2007). https://norvig.com/spell-correct.html
Pavel, B., Aleksandr, N., Galina, Z., Arina, R., Vladimir, K., Chaitanya, S.: Rumednli: A russian natural language inference dataset for the clinical domain (2022). http://doi.org/10.13026/gxzd-cf80
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets (2019). https://arxiv.org/abs/1906.05474
Romanov, A., Shivade, C.: Lessons from natural language inference in the clinical domain (2018). https://arxiv.org/abs/1808.06752
Shelmanov, A.O., Smirnov, I.V., Vishneva, E.A.: Information extraction from clinical texts in russian. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference "Dialogue" , vol. 270, pp. 537–549 (2015)
Sorokin, A., Baytin, A., Galinskaya, I., Rykunova, E., Shavrina, T.: Spellrueval : the first competition on automatic spelling correction for russian (2016). https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf
Starovoitova, E., et al.: RuMedPrimeData (2021). https://doi.org/10.5281/zenodo.5765873
Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 144–151. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073109
Yalunin, A., Nesterov, A., Umerenkov, D.: Rubioroberta: a pre-trained biomedical language model for russian language biomedical text mining (2022). https://arxiv.org/abs/2204.03951
Acknowledgments
This work was supported by the Ministry of Science and Higher Education of Russian Federation, goszadanie no. 2019-1339.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pogrebnoi, D., Funkner, A., Kovalchuk, S. (2023). RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 10475. Springer, Cham. https://doi.org/10.1007/978-3-031-36024-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-36024-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36023-7
Online ISBN: 978-3-031-36024-4
eBook Packages: Computer ScienceComputer Science (R0)