skip to main content
10.1145/3555776.3578577acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

Published:07 June 2023Publication History

ABSTRACT

Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of those records. With the advent of deep neural models, this objective has been partially attained for English clinical texts, however, the research community still lacks an effective solution for languages with limited resources. In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. This project was conducted in collaboration with the Portuguese Institute for Oncology which, besides holding over 10 years of duly protected medical records, also provided oncologist expertise throughout the development of the project. Since there is no annotated corpus for biomedical entity extraction in Portuguese, we also present the strategy we followed in annotating the corpus for the development of the models. The final models, which combined a neural architecture with entity linking, achieved F1 scores of 88.6, 95.0, and 55.8 per cent in the mention extraction of procedures, drugs, and diseases, respectively.

References

  1. Mohamed AlShuweihi, Said A. Salloum, and Khaled F. Shaalan. 2021. Biomedical Corpora and Natural Language Processing on Clinical Text in Languages Other Than English: A Systematic Review. In Recent Advances in Intelligent Systems and Smart Applications.Google ScholarGoogle Scholar
  2. Jean Emmanuel Bibault, Philippe Giraud, and Anita Burgun. 2016. Big Data and machine learning in radiation oncology: State of the art and future prospects. Cancer Letters 382, 1 (2016), 110--117. Google ScholarGoogle ScholarCross RefCross Ref
  3. Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267--D270.Google ScholarGoogle Scholar
  4. Mary Regina Boland, Lena M. Davidson, Silvia P. Canelón, Jessica Meeker, Trevor Penning, John H. Holmes, and Jason H. Moore. 2021. Harnessing electronic health records to study emerging environmental disasters: a proof of concept with perfluoroalkyl substances (PFAS). npj Digital Medicine 4, 1 (aug 2021), 1--10. Google ScholarGoogle ScholarCross RefCross Ref
  5. Selen Bozkurt, Rohan Paul, Jean Coquet, Ran Sun, Imon Banerjee, James D. Brooks, and Tina Hernandez-Boussard. 2020. Phenotyping severity of patient-centered outcomes using clinical notes: A prostate cancer use case. Learning Health Systems 4, 4 (oct 2020). Google ScholarGoogle ScholarCross RefCross Ref
  6. David Campos, Sérgio Matos, and José Luís Oliveira. 2012. Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools.Google ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).Google ScholarGoogle Scholar
  8. Shadi Ebrahimian, Mannudeep K. Kalra, Sheela Agarwal, Bernardo Canedo Bizzo, Mona Elkholy, Christoph Wald, Bibb Allen, and Keith J. Dreyer. 2021. FDA-regulated AI Algorithms: Trends, Strengths, and Gaps of Validation Studies. Academic radiology (2021).Google ScholarGoogle Scholar
  9. Guy Fagherazzi. 2020. Deep Digital Phenotyping and Digital Twins for Precision Health: Time to Dig Deeper. Journal of medical Internet research 22, 3 (2020), e16770. Google ScholarGoogle ScholarCross RefCross Ref
  10. John Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34 (2018), 4087--4094.Google ScholarGoogle ScholarCross RefCross Ref
  11. Mark L. Graber, Colene Byrne, and Doug Johnston. 2017. The impact of electronic health records on diagnosis., 211--223 pages. Google ScholarGoogle ScholarCross RefCross Ref
  12. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (oct 2021). arXiv:2007.15779 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Maryam Habibi, Leon Weber, Mariana L. Neves, D. Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33 (2017), i37 -- i48.Google ScholarGoogle ScholarCross RefCross Ref
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peter B. Jensen, Lars J. Jensen, and Soøren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care., 395--405 pages. Google ScholarGoogle ScholarCross RefCross Ref
  16. Yasmin H. Karimi, Douglas W. Blayney, Allison W. Kurian, Jeanne Shen, Rikiya Yamashita, Daniel Rubin, and Imon Banerjee. 2021. Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 469--478. Google ScholarGoogle ScholarCross RefCross Ref
  17. Kenneth L. Kehl, Stefan Groha, Eva M. Lepisto, Haitham Elmarakeby, James Lindsay, Alexander Gusev, Eliezer M. Van Allen, Michael J. Hassett, and Deborah Schrag. 2021. Clinical Inflection Point Detection on the Basis of EHR Data to Identify Clinical Trial-Ready Patients With Cancer. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 622--630. Google ScholarGoogle ScholarCross RefCross Ref
  18. Juae Kim, Youngjoong Ko, and Jungyun Seo. 2019. A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains. IEEE Access 7 (2019), 70308--70318.Google ScholarGoogle ScholarCross RefCross Ref
  19. John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML.Google ScholarGoogle Scholar
  20. Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), 541--551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (2020), 1234--1240.Google ScholarGoogle ScholarCross RefCross Ref
  22. Scott H. Lee. 2018. Natural language generation for electronic health records. npj Digital Medicine 1, 1 (nov 2018), 1--7. arXiv:1806.01353 Google ScholarGoogle ScholarCross RefCross Ref
  23. Ivan Lerner, Nicolas Paris, and Xavier Tannier. 2020. Terminologies augmented recurrent neural network model for clinical named entity recognition. Journal of Biomedical Informatics 102 (2020), 103356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ulf Leser and Jörg Hakenberg. 2005. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in bioinformatics 6 4 (2005), 357--69.Google ScholarGoogle Scholar
  25. Irene Z Li, Michihiro Yasunaga, Muhammed Yavuz Nuzumlali, César Caraballo, Shiwani Mahajan, Harlan M. Krumholz, and Dragomir R. Radev. 2019. A Neural Topic-Attention Model for Medical Term Abbreviation Disambiguation. ArXiv abs/1910.14076 (2019).Google ScholarGoogle Scholar
  26. Ke Liu, Omkar Kulkarni, Martin Witteveen-Lane, Bin Chen, and Dave Chesla. 2022. MetBERT: a generalizable and pre-trained deep learning model for the prediction of metastatic cancer from clinical notes. AMIA ... Annual Symposium proceedings. AMIA Symposium 2022 (2022), 331--338. /pmc/articles/PMC9285138//pmc/articles/PMC9285138/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9285138/Google ScholarGoogle Scholar
  27. Xiaoxuan Liu, Livia Faes, Aditya Uday Kale, Siegfried Karl Wagner, Dun Jack Fu, Alice Bruynseels, Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas, Christoph Kern, Joseph R. Ledsam, Martin K. Schmid, Konstantinos Balaskas, Eric J. Topol, Lucas M. Bachmann, Pearse A. Keane, and Alastair K. O. Denniston. 2019. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet. Digital health 1 6 (2019), e271--e297.Google ScholarGoogle Scholar
  28. Yue Liu, Tao Ge, Kusum S. Mathews, Heng Ji, and Deborah L. McGuinness. 2015. Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion. ArXiv abs/1804.04225 (2015).Google ScholarGoogle Scholar
  29. Fábio Lopes, César Alexandre Teixeira, and Hugo Gonçalo Oliveira. 2019. Contributions to Clinical Named Entity Recognition in Portuguese. In BioNLP@ACL.Google ScholarGoogle Scholar
  30. Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. ArXiv abs/1603.01354 (2016).Google ScholarGoogle Scholar
  31. Aurélie Névéol, Hercules Dalianis, Guergana K. Savova, and Pierre Zweigenbaum. 2018. Clinical Natural Language Processing in languages other than English: opportunities and challenges. Journal of Biomedical Semantics 9 (2018).Google ScholarGoogle Scholar
  32. Denis Newman-Griffis and Ayah Zirikly. 2018. Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility. In BioNLP.Google ScholarGoogle Scholar
  33. Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using Transformation-Based Learning. ArXiv cmp-lg/9505040 (1995).Google ScholarGoogle Scholar
  34. Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Julien Knafou, Lucas E. S. Oliveira, Jenny Copara, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Emerson Cabrera Paraiso, Douglas Teodoro, and Claudia Maria Cabral Moro Barra. 2020. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In CLINICALNLP.Google ScholarGoogle Scholar
  35. Stefano Silvestri, Francesco Gargiulo, and Mario Ciampi. 2022. Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases. Applied Sciences (2022).Google ScholarGoogle Scholar
  36. Luca Soldaini. 2016. QuickUMLS: a Fast, Unsupervised Approach for Medical Concept Extraction.Google ScholarGoogle Scholar
  37. Fábio Souza, Rodrigo Nogueira, and Roberto de Alencar Lotufo. 2020. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In BRACIS.Google ScholarGoogle Scholar
  38. Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of biomedical informatics 76 (2017), 102--109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. ArXiv abs/1706.03762 (2017).Google ScholarGoogle Scholar
  40. Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, C. Langlotz, and Jiawei Han. 2019. Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning. Bioinformatics 35 10 (2019), 1745--1752.Google ScholarGoogle Scholar
  41. Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and Hongfang Liu. 2018. Clinical information extraction applications: A literature review. Journal of biomedical informatics 77 (2018), 34--49.Google ScholarGoogle ScholarCross RefCross Ref
  42. Wonjin Yoon, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. 2019. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 20 (2019).Google ScholarGoogle Scholar
  43. Qiang Zhang, Sheng Zhang, Jianxin Li, Yi Pan, Jing Zhao, Yixing Feng, Yanhui Zhao, Xiaoqing Wang, Zhiming Zheng, Xiangming Yang, Lixia Liu, Chunxin Qin, Ke Zhao, Xiaonan Liu, Caixia Li, Liuyang Zhang, Chunrui Yang, Na Zhuo, Hong Zhang, Jie Liu, Jinglei Gao, Xiaoling Di, Fanbo Meng, Wei Ji, Meng Yang, Xiaojie Xin, Xi Wei, Rui Jin, Lun Zhang, Xudong Wang, Fengju Song, Xiangqian Zheng, Ming Gao, Kexin Chen, and Xiangchun Li. 2022. Improved diagnosis of thyroid cancer aided with deep learning applied to sonographic text reports: a retrospective, multi-cohort, diagnostic study. Cancer Biology and Medicine 19, 5 (may 2022), 733--741. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing
            March 2023
            1932 pages
            ISBN:9781450395175
            DOI:10.1145/3555776

            Copyright © 2023 ACM

            Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 June 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,650of6,669submissions,25%
          • Article Metrics

            • Downloads (Last 12 months)62
            • Downloads (Last 6 weeks)4

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader