research-article

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

Authors:
Hugo Sousa

INESC TEC, Porto, Portugal

FCUP, Porto, Portugal

INESC TEC, Porto, Portugal

FCUP, Porto, Portugal

https://orcid.org/0000-0003-3226-9189
View Profile

,
Alipio Mario Jorge

INESC TEC, Porto, Portugal

FCUP, Porto, Portugal

INESC TEC, Porto, Portugal

FCUP, Porto, Portugal

https://orcid.org/0000-0002-5475-1382
View Profile

,
Arian Pasquali

INESC TEC, Porto, Portugal

INESC TEC, Porto, Portugal

https://orcid.org/0000-0002-3487-9397
View Profile

,
Catarina Santos

INESC TEC, Porto, Portugal

INESC TEC, Porto, Portugal

https://orcid.org/0000-0002-9327-4486
View Profile

,
Mario Lopes

INESC TEC, Porto, Portugal

INESC TEC, Porto, Portugal

https://orcid.org/0000-0001-9609-4723
View Profile

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied ComputingMarch 2023Pages 950–956https://doi.org/10.1145/3555776.3578577

Published:07 June 2023Publication History

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Pages 950–956

ABSTRACT

Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of those records. With the advent of deep neural models, this objective has been partially attained for English clinical texts, however, the research community still lacks an effective solution for languages with limited resources. In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. This project was conducted in collaboration with the Portuguese Institute for Oncology which, besides holding over 10 years of duly protected medical records, also provided oncologist expertise throughout the development of the project. Since there is no annotated corpus for biomedical entity extraction in Portuguese, we also present the strategy we followed in annotating the corpus for the development of the models. The final models, which combined a neural architecture with entity linking, achieved F₁ scores of 88.6, 95.0, and 55.8 per cent in the mention extraction of procedures, drugs, and diseases, respectively.

References

Mohamed AlShuweihi, Said A. Salloum, and Khaled F. Shaalan. 2021. Biomedical Corpora and Natural Language Processing on Clinical Text in Languages Other Than English: A Systematic Review. In Recent Advances in Intelligent Systems and Smart Applications.Google Scholar
Jean Emmanuel Bibault, Philippe Giraud, and Anita Burgun. 2016. Big Data and machine learning in radiation oncology: State of the art and future prospects. Cancer Letters 382, 1 (2016), 110--117. Google ScholarCross Ref
Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267--D270.Google Scholar
Mary Regina Boland, Lena M. Davidson, Silvia P. Canelón, Jessica Meeker, Trevor Penning, John H. Holmes, and Jason H. Moore. 2021. Harnessing electronic health records to study emerging environmental disasters: a proof of concept with perfluoroalkyl substances (PFAS). npj Digital Medicine 4, 1 (aug 2021), 1--10. Google ScholarCross Ref
Selen Bozkurt, Rohan Paul, Jean Coquet, Ran Sun, Imon Banerjee, James D. Brooks, and Tina Hernandez-Boussard. 2020. Phenotyping severity of patient-centered outcomes using clinical notes: A prostate cancer use case. Learning Health Systems 4, 4 (oct 2020). Google ScholarCross Ref
David Campos, Sérgio Matos, and José Luís Oliveira. 2012. Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).Google Scholar
Shadi Ebrahimian, Mannudeep K. Kalra, Sheela Agarwal, Bernardo Canedo Bizzo, Mona Elkholy, Christoph Wald, Bibb Allen, and Keith J. Dreyer. 2021. FDA-regulated AI Algorithms: Trends, Strengths, and Gaps of Validation Studies. Academic radiology (2021).Google Scholar
Guy Fagherazzi. 2020. Deep Digital Phenotyping and Digital Twins for Precision Health: Time to Dig Deeper. Journal of medical Internet research 22, 3 (2020), e16770. Google ScholarCross Ref
John Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34 (2018), 4087--4094.Google ScholarCross Ref
Mark L. Graber, Colene Byrne, and Doug Johnston. 2017. The impact of electronic health records on diagnosis., 211--223 pages. Google ScholarCross Ref
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (oct 2021). arXiv:2007.15779 Google ScholarDigital Library
Maryam Habibi, Leon Weber, Mariana L. Neves, D. Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33 (2017), i37 -- i48.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735--1780.Google ScholarDigital Library
Peter B. Jensen, Lars J. Jensen, and Soøren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care., 395--405 pages. Google ScholarCross Ref
Yasmin H. Karimi, Douglas W. Blayney, Allison W. Kurian, Jeanne Shen, Rikiya Yamashita, Daniel Rubin, and Imon Banerjee. 2021. Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 469--478. Google ScholarCross Ref
Kenneth L. Kehl, Stefan Groha, Eva M. Lepisto, Haitham Elmarakeby, James Lindsay, Alexander Gusev, Eliezer M. Van Allen, Michael J. Hassett, and Deborah Schrag. 2021. Clinical Inflection Point Detection on the Basis of EHR Data to Identify Clinical Trial-Ready Patients With Cancer. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 622--630. Google ScholarCross Ref
Juae Kim, Youngjoong Ko, and Jungyun Seo. 2019. A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains. IEEE Access 7 (2019), 70308--70318.Google ScholarCross Ref
John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML.Google Scholar
Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), 541--551.Google ScholarDigital Library
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (2020), 1234--1240.Google ScholarCross Ref
Scott H. Lee. 2018. Natural language generation for electronic health records. npj Digital Medicine 1, 1 (nov 2018), 1--7. arXiv:1806.01353 Google ScholarCross Ref
Ivan Lerner, Nicolas Paris, and Xavier Tannier. 2020. Terminologies augmented recurrent neural network model for clinical named entity recognition. Journal of Biomedical Informatics 102 (2020), 103356. Google ScholarDigital Library
Ulf Leser and Jörg Hakenberg. 2005. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in bioinformatics 6 4 (2005), 357--69.Google Scholar
Irene Z Li, Michihiro Yasunaga, Muhammed Yavuz Nuzumlali, César Caraballo, Shiwani Mahajan, Harlan M. Krumholz, and Dragomir R. Radev. 2019. A Neural Topic-Attention Model for Medical Term Abbreviation Disambiguation. ArXiv abs/1910.14076 (2019).Google Scholar
Ke Liu, Omkar Kulkarni, Martin Witteveen-Lane, Bin Chen, and Dave Chesla. 2022. MetBERT: a generalizable and pre-trained deep learning model for the prediction of metastatic cancer from clinical notes. AMIA ... Annual Symposium proceedings. AMIA Symposium 2022 (2022), 331--338. /pmc/articles/PMC9285138//pmc/articles/PMC9285138/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9285138/Google Scholar
Xiaoxuan Liu, Livia Faes, Aditya Uday Kale, Siegfried Karl Wagner, Dun Jack Fu, Alice Bruynseels, Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas, Christoph Kern, Joseph R. Ledsam, Martin K. Schmid, Konstantinos Balaskas, Eric J. Topol, Lucas M. Bachmann, Pearse A. Keane, and Alastair K. O. Denniston. 2019. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet. Digital health 1 6 (2019), e271--e297.Google Scholar
Yue Liu, Tao Ge, Kusum S. Mathews, Heng Ji, and Deborah L. McGuinness. 2015. Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion. ArXiv abs/1804.04225 (2015).Google Scholar
Fábio Lopes, César Alexandre Teixeira, and Hugo Gonçalo Oliveira. 2019. Contributions to Clinical Named Entity Recognition in Portuguese. In BioNLP@ACL.Google Scholar
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. ArXiv abs/1603.01354 (2016).Google Scholar
Aurélie Névéol, Hercules Dalianis, Guergana K. Savova, and Pierre Zweigenbaum. 2018. Clinical Natural Language Processing in languages other than English: opportunities and challenges. Journal of Biomedical Semantics 9 (2018).Google Scholar
Denis Newman-Griffis and Ayah Zirikly. 2018. Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility. In BioNLP.Google Scholar
Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using Transformation-Based Learning. ArXiv cmp-lg/9505040 (1995).Google Scholar
Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Julien Knafou, Lucas E. S. Oliveira, Jenny Copara, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Emerson Cabrera Paraiso, Douglas Teodoro, and Claudia Maria Cabral Moro Barra. 2020. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In CLINICALNLP.Google Scholar
Stefano Silvestri, Francesco Gargiulo, and Mario Ciampi. 2022. Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases. Applied Sciences (2022).Google Scholar
Luca Soldaini. 2016. QuickUMLS: a Fast, Unsupervised Approach for Medical Concept Extraction.Google Scholar
Fábio Souza, Rodrigo Nogueira, and Roberto de Alencar Lotufo. 2020. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In BRACIS.Google Scholar
Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of biomedical informatics 76 (2017), 102--109.Google ScholarDigital Library
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. ArXiv abs/1706.03762 (2017).Google Scholar
Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, C. Langlotz, and Jiawei Han. 2019. Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning. Bioinformatics 35 10 (2019), 1745--1752.Google Scholar
Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and Hongfang Liu. 2018. Clinical information extraction applications: A literature review. Journal of biomedical informatics 77 (2018), 34--49.Google ScholarCross Ref
Wonjin Yoon, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. 2019. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 20 (2019).Google Scholar
Qiang Zhang, Sheng Zhang, Jianxin Li, Yi Pan, Jing Zhao, Yixing Feng, Yanhui Zhao, Xiaoqing Wang, Zhiming Zheng, Xiangming Yang, Lixia Liu, Chunxin Qin, Ke Zhao, Xiaonan Liu, Caixia Li, Liuyang Zhang, Chunrui Yang, Na Zhuo, Hong Zhang, Jie Liu, Jinglei Gao, Xiaoling Di, Fanbo Meng, Wei Ji, Meng Yang, Xiaojie Xin, Xi Wei, Rui Jin, Lun Zhang, Xudong Wang, Fengju Song, Xiangqian Zheng, Ming Gao, Kexin Chen, and Xiangchun Li. 2022. Improved diagnosis of thyroid cancer aided with deep learning applied to sonographic text reports: a retrospective, multi-cohort, diagnostic study. Cancer Biology and Medicine 19, 5 (may 2022), 733--741. Google ScholarCross Ref

Index Terms

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document structure
  2. Information systems applications
    1. Data mining
    2. Decision support systems

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Electronic health records (EHR) facilitate integration of patient health history for planning safe and proper treatment. Combined with data analytics, aggregate-level EHR enable examination and development of effective medicines and therapies for ...
Read More
Designing Patient-Centered Personal Health Records (PHRs): Health Care Professionals' Perspective on Patient-Generated Data

Currently, patients not only want access to various medical records their health care providers keep about them, but they also are willing to become active participants in managing their own health information and the health information of the ones they ...
Read More
Mining Electronic Health Records (EHRs): A Survey

The continuously increasing cost of the US healthcare system has received significant attention. Central to the ideas aimed at curbing this trend is the use of technology in the form of the mandate to implement electronic health records (EHRs). EHRs ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing
March 2023
1932 pages
ISBN:9781450395175
DOI:10.1145/3555776
Conference Chairs:
Jiman Hong
Soongsil University, South Korea
,
Maart Lanperne
Tallinn University, Estonia
,
Program Chairs:
Juw Won Park
University of Louisville, USA
,
Tomas Cerny
Baylor University, USA
,
Publication Chair:
Hossain Shahriar
Kennesaw State University, USA
Copyright © 2023 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
biomedical entity recognition
data mining
oncology electronic health records
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 62
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Designing Patient-Centered Personal Health Records (PHRs): Health Care Professionals' Perspective on Patient-Generated Data

Mining Electronic Health Records (EHRs): A Survey