Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records

doi:10.1016/j.ijmedinf.2013.11.005

International Journal of Medical Informatics

Volume 83, Issue 4, April 2014, Pages 303-312

https://doi.org/10.1016/j.ijmedinf.2013.11.005 Get rights and content

Highlights

•
Usual de-identification methods rely on either natural language processing or pattern matching.
•
Those methods require pre-existent material in the appropriate language, respectively, dictionaries or de-identified documents.
•
FASDIM is a new method based on pattern matching, where no pre-existent material is required: the words are filtered by the operator on-the-fly.
•
FASDIM has been tested on 508 French discharge summaries and obtains: recall = 98.1% (no name remains), precision = 79.6%, F-measure = 87.9%.
•
The reports are encoded before and after de-identification: 99.0% of the codes (ICD10, ATC, CCAM for procedures) are preserved.

Abstract

Purpose

Medical free-text records enable to get rich information about the patients, but often need to be de-identified by removing the Protected Health Information (PHI), each time the identification of the patient is not mandatory. Pattern matching techniques require pre-defined dictionaries, and machine learning techniques require an extensive training set. Methods exist in French, but either bring weak results or are not freely available. The objective is to define and evaluate FASDIM, a Fast And Simple De-Identification Method for French medical free-text records.

Methods

FASDIM consists in removing all the words that are not present in the authorized word list, and in removing all the numbers except those that match a list of protection patterns. The corresponding lists are incremented in the course of the iterations of the method.

For the evaluation, the workload is estimated in the course of records de-identification. The efficiency of the de-identification is assessed by independent medical experts on 508 discharge letters that are randomly selected and de-identified by FASDIM. Finally, the letters are encoded after and before de-identification according to 3 terminologies (ATC, ICD10, CCAM) and the codes are compared.

Results

The construction of the list of authorized words is progressive: 12 h for the first 7000 letters, 16 additional hours for 20,000 additional letters. The Recall (proportion of removed Protected Health Information, PHI) is 98.1%, the Precision (proportion of PHI within the removed token) is 79.6% and the F-measure (harmonic mean) is 87.9%. In average 30.6 terminology codes are encoded per letter, and 99.02% of those codes are preserved despite the de-identification.

Conclusion

FASDIM gets good results in French and is freely available. It is easy to implement and does not require any predefined dictionary.

Introduction

Computerized free-text medical records are important information sources for research. In most countries, each time a patient is discharged from a healthcare facility, a discharge letter has to be written: it summarizes all the pertinent information from the reason for admission to the discharge drug treatment. Those letters are routinely produced and provide the researchers with a big amount of medical information. On the other hand, the confidentiality must imperatively be respected: as soon as a discharge letter is not used with direct benefit to the patient and if the patient does not need to be identified, the letter must be de-identified. The anonymization consists in removing the patients’ names from the records: unfortunately, other pieces of information enable to identify the patients. The de-identification is a more exhaustive removal of the entire Protected Health Information (PHI), so that the patients cannot be identified, directly nor indirectly. In the US, privacy rules have been enacted by the Department of Health and Human Services further to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) [1]. In order to de-identify a high number of records, it is necessary to use automated methods, as manual methods require too high workload [2].

Several methods exist for automated de-identification of free-text records [3], including procedures reports and discharge letters.

Pattern matching methods [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] consist in applying rules that enable to keep or remove some words that belong to dictionaries that have been predefined by experts or institutions. For instance, it is possible to remove all the words that belong to a list of town names, or to preserve all the words that belong to a list of medical terms (such as the Unified Medical Language System [17]). Additional rules may be used to take into account words declension and verbs conjugation. This approach requires that such lists are available. When they exist, those lists are language-dependent, and are suitable for a specific context only (e.g. town names or current family names are useless in another country).

Machine learning methods [14], [18], [19], [20], [21], [22], [23], [24], [25], [26] are derived from artificial intelligence. A learning phase requires that a corpus of records is previously de-identified manually by experts. Those methods are often very efficient, depending on the quality and the completeness of the learning corpus.

Whatever the method used, the de-identification is evaluated by computing three rates:

-
The recall (or sensitivity or completeness, Eq. (1)), which is the proportion of removed token within the PHI. A high recall enables to preserve the confidentiality.
-
The precision (or positive predictive value or correctness, Eq. (2)), which is the proportion of PHI within the removed token. A high precision enables to preserve the readability of the text.
-
The F-Measure, which is the harmonic mean of the recall and the precision (Eq. (3)).

recall = R = \frac{TP}{# identifiers} = \frac{TP}{TP + FN}

precision = P = \frac{TP}{# removed} = \frac{TP}{TP + FP}

F -measure = F = {(\frac{R^{- 1} + P^{- 1}}{2})}^{- 1}

Table 1 presents the main results obtained in the literature by the authors for medical free-text de-identification. Most of methods are developed for English language and can hardly be used for other languages. Some methods have been developed in French, but either their results are disappointing, or they are not freely available.

Despite the good results obtained by many methods, text de-identification is still not obvious and some situations may not be addressed with current tools. We shall illustrate it through 4 use cases.

Case 1: a team has to de-identify English free-text records using pattern-matching. Some tools are freely available. However, it cannot be guaranteed that those tools could be applied in a different context without any adaptation. Indeed, pattern matching techniques rely on lists of words that are context-dependent: for instance “lime tree” should be removed in most reports as it is often part of a street name, but should not be removed in an allergy-related report. Lists of town names or family names also depend on the country. Finally, misspellings are most often not taken into account by existing methods.

Case 2: a team has to de-identify English free-text records using machine learning. Here again, some tools are freely available but, in a like manner, machine learning techniques require a pre-existing corpus of de-identified records. Such corpuses are available in English [11], [36], [37], but they may be used only if the type of document to de-identify is the same as the documents of the training corpus.

Case 3: a team has to de-identify French free-text records (the problem is the same with most of non-English languages): no free and efficient method, no list of words, and no training corpus are available. Everything has to be built.

Case 4: a team has only little time (e.g. 1 man-week) to de-identify a few records (e.g. 25,000 records). Whatever the language, the context and the technique, it will probably take more time to understand, adapt, implement and execute an existing tool.

The conception of FASDIM relies on the idea that a simple de-identification technique could enable to de-identify French discharge letters with an acceptable workload, particularly when the number of records is low. The main idea is to supply the workload in the course of the method, and not before the first document can be de-identified.

The first general objective of this work is to design and implement FASDIM, a Fast And Simple De-Identification Method for clinical free-text records. The second general objective is to evaluate the method.

To reach the first general objective, operational objectives are (1) to design a method that reaches good results in French using completely unstructured free-text records, but (2) is as independent as possible from the language structure (i.e. for instance does not consider the declension of words and the conjugation of verbs) and (3) does not rely on any pre-existing material (list of words or corpus of de-identified documents), in order to (4) be easily and fast reproducible from scratch by any hospital or research team.

To reach the second general objective, operational objectives are (5) to objectively compute traditional evaluation metrics but also (6) to evaluate the preservation of medical information and (7) to evaluate the workload required to implement the method.

The method is implemented and evaluated in French, but the examples that are given in this paper here are translated into English.

Section snippets

Definition of FASDIM

FASDIM stands for Fast And Simple De-Identification Method. This method is composed of 3 steps (Fig. 1).

Material and method of the evaluation

The FASDIM method has been first developed to meet the needs of a research project, with imposed deadlines: that explains why the numbers of records at each step are not regular. Seven successive sets of unstructured discharge letters are extracted from the HIS of a general French hospital:

•
A first set of 20 records used to develop and test the method.
•
Successive cumulative sets of records: 7012 then 9503 then 16,009 then 17,812 then 23,493 letters.
•
Finally, from the last cumulated extraction of

First evaluation phase

The accuracy of the de-identification is evaluated using 508 discharge letters (Table 3). The recall is 98.1% [97.8%; 98.4%], the precision is 79.6% [78.9%; 80.3%] and the F-measure is 87.9%. Many auxiliary verbs are over-scrubbed because they stand nearby family names, but it does not alter the legibility of the text. If the suppression of auxiliary verbs is ignored, the precision reaches 89.2% and the F-Measure reaches 93.4%.

In average, 0.36 PHI token are inappropriately preserved per letter

Discussion

In this work, FASDIM, a Fast And Simple De-Identification Method for clinical unstructured free-text records, has been defined and evaluated in French language. The operational objectives defined in Section 1 have been reached (Table 6).

Objectives 1, 5 and 6: the method reaches very good results in French. The recall is 98.1%, the precision is 79.6% and the F-measure is 87.9%. A less strict evaluation gives a precision of 89.2% and an F-Measure of 93.4%. Moreover, 99.0% of the medical

Conclusion

FASDIM is a fast and simple algorithm that enables to de-identify French free-text discharge letters. It preserves the patient confidentiality without threatening medical information. Is seems to be suitable especially when a medium corpus of letters has to be de-identified in a limited amount of time. Examples of source code and lists of words are freely available on the web [27]. The same method should be experimented and evaluated on other types of texts, including less formal texts (such as

Role of the funding source

The study sponsor had no role on the study or the writing of the manuscript.

Author contributions

Emmanuel Chazard and Grégoire Ficheur have designed and implemented FASDIM. Capucine Mouret has performed the bibliographic analysis. Emmanuel Chazard and Capucine Mouret have designed the evaluation methodology. Capucine Mouret, Aurélien Schaffar and Jean-Baptiste Beuscart have performed the evaluation of FASDIM. The article has been written by Emmanuel Chazard and Capucine Mouret, and has been reviewed by all the authors, especially Régis Beuscart who is the department chair.

Conflicts of interest

The authors are employed by a public organization. There is no business exploitation or patent of the method presented and evaluated in this paper.

The evaluators and the developers of the system tested are not the same persons, do not belong to the same department, but are employed by the same hospital and frequently work together.

Summary points

What was already known:

•
Several methods exist for free-text de-identification.
•
Pattern matching methods require that dictionaries are already available.
•

Acknowledgment

The research leading to these results has received funding from the European Community's Seventh Framework Program (FP7/2007–2013) under Grant Agreement no. 216130 – the PSIP project.

References (37)

F.J. Friedlin et al.
A software tool for removing patient identifying information from clinical documents
J. Am. Med. Inform. Assoc.
(2008)
F.P. Morrison
Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes?
J. Am. Med. Inform. Assoc.
(2009)
S. Velupillai et al.
Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial
Int. J. Med. Inform.
(2009)
J. Aberdeen et al.
The MITRE Identification Scrubber Toolkit: design, training, and assessment
Int. J. Med. Inform.
(2010)
G. Szarvas et al.
State-of-the-art anonymization of medical records using an iterative machine learning framework
J. Am. Med. Inform. Assoc.
(2007)
Ö. Uzuner et al.
A de-identifier for medical discharge summaries
Artif. Intell. Med.
(2008)
B. Wellner et al.
Rapidly retargetable approaches to de-identification in medical records
J. Am. Med. Inform. Assoc.
(2007)
Ö. Uzuner et al.
Evaluating the state-of-the-art in automatic de-identification
J. Am. Med. Inform. Assoc.
(2007)
T. Neubauer et al.
A methodology for the pseudonymization of medical data
Int. J. Med. Inform.
(2011)
H. Suominen et al.
Applying language technology to nursing documents: pros and cons with a focus on ethics
Int. J. Med. Inform.
(2007)

O. Uzuner et al.

Evaluating the state-of-the-art in automatic de-identification

J. Am. Med. Inform. Assoc.

(2007)

Summary of the HIPAA Privacy Rule

(2013)

D.A. Dorr et al.

Assessing the difficulty and time cost of de-identification in clinical narratives

Methods Inf. Med.

(2006)

S.M. Meystre et al.

Automatic de-identification of textual documents in the electronic health record: a review of recent research

BMC Med. Res. Methodol.

(2010)

B.A. Beckwith et al.

Development and evaluation of an open source software tool for deidentification of pathology reports

BMC Med. Inform. Decis. Mak.

(2006)

J.J. Berman

Concept-match medical data scrubbing. How pathology text can be used in research

Arch. Pathol. Lab. Med.

(2003)

E.M. Fielstein et al.

Algorithmic De-identification of VA Medical Exam Text for HIPAA Privacy Compliance: Preliminary Findings; Medinfo

(2004)

D. Gupta et al.

Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research

Am. J. Clin. Pathol.

(2004)

Cited by (21)

De-identification of clinical free text using natural language processing: A systematic review of current approaches
2024, Artificial Intelligence in Medicine
Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process.
Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field.
A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance.
A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.
Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.
A machine learning based approach to identify protected health information in Chinese clinical text
2018, International Journal of Medical Informatics
Citation Excerpt :
Methods for automatic de-identification applied in medical domain are mostly classified into the rule-based methods, machine learning based methods, and some hybrid approaches. Rule-based clinical text de-identification systems [19,20] use an array of rules, patterns and specialized semantic dictionaries to identify PHI in medical records. Although these systems can be easily and quickly modified by adding rules, dictionary terms or regular expressions to improve the overall performance, they are often criticized for the limited generalizability that depends on the quality of the patterns and rules.
With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined.
We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text.
Based on 14,719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we built a conditional random fields (CRF) model to identify PHI in clinical text, and then used the regular expressions to optimize the recognition results of the PHI categories with fewer samples.
We constructed a Chinese clinical text corpus with PHI tags through substantial manual annotation, wherein the descriptive statistics of PHI manifested its wide range and diverse categories. The evaluation showed with a high F-measure of 0.9878 that our CRF-based model had a good performance for identifying PHI in Chinese clinical text.
The rapid adoption of EHR in the health sector has created an urgent need for tools that can parse patient specific information from Chinese clinical text. Our application of CRF algorithms for de-identification has shown the potential to meet this need by offering a highly accurate and flexible solution to analyzing Chinese clinical text.
Co-prescriptions of psychotropic drugs to older patients in a general hospital
2017, European Geriatric Medicine
Citation Excerpt :
The following data were encoded in the database: demographic and administrative data; diagnostic data, according to the International Classification of Diseases; 10th Revision (ICD-10) [23]; diagnostic and therapeutic procedures, according to the French Classification commune des actes médicaux (CCAM) classification; medication data, according to the Anatomical Therapeutic Chemical (ATC) Classification [24]; laboratory results, according to the Committee on Nomenclature, Properties and Units classification (IUPAC) [25]. The text of discharge letters and hospital care reports were anonymized (using the “Fast And Simple De-Identification Method”) and encoded as text files [26]. The database also listed all medications administered to the patient by medical staff during the stay.
The prescription of psychotropic drugs to older patients in a hospital setting has not been extensively characterized. The objective was to describe the inappropriate co-prescriptions of psychotropic drugs in hospitalized patients aged 75 and over.
By analysing the medical database from 222-bed general hospital in France, we reviewed a total of 11,929 stays of at least 3 days by patients aged 75 and over. Prescriptions and co-prescriptions of psychotropic drugs were identified automatically. Anticholinergic drugs with sedative effects were considered as psychotropic drugs. An expert review was performed for stays with the co-prescription of three or more psychotropic drugs to identify inappropriate co-prescriptions.
Administration of a psychotropic drug was identified in 5475 stays (45.9% of the total number of stays), of which 1526 (12.8% of the total) featured at least one co-prescription. Co-prescriptions of three or more psychotropic drugs for at least 3 days were identified in 374 stays (3.1% of the total). Most of these co-prescriptions (n = 334; 89.2%) were considered inappropriate because of the combination of at least two drugs from the same psychotropic class (n = 269), the absence of a clear indication for a psychotropic drug (n = 173) and a history of falls (n = 86). However, the co-prescriptions were maintained after hospital discharge in 77.4% of cases.
The co-prescriptions of psychotropic drugs should be re-evaluated in older hospitalized patients.
The completeness of electronic medical record data for patients with Type 2 Diabetes in primary care and its implications for computer modelling of predicted clinical outcomes
2016, Primary Care Diabetes
Citation Excerpt :
Previous research has identified that barriers to physicians entering clinical data as coded entries include time constraints during consultations [18], issues with software interfaces and codes [18,19], and the under appreciation of the usefulness of coded data as a quality indicator [20]. Although attempts have been made to extract clinical observation data from free text fields [21,22] such approaches are likely to have inherent limitations related to variations in users’ text recording practices. A recent literature review looking at routinely collected electronic clinical data and chronic disease management identified completeness, accuracy, correctness and timeliness as major dimensions that need to be considered when assessing data quality for both research and patient care purposes [23].
To describe the completeness of routinely collected primary care data that could be used by computer models to predict clinical outcomes among patients with Type 2 Diabetes (T2D).
Data on blood pressure, weight, total cholesterol, HDL-cholesterol and glycated haemoglobin levels for regular patients were electronically extracted from the medical record software of 12 primary care practices in Australia for the period 2000–2012. The data was analysed for temporal trends and for associations between patient characteristics and completeness. General practitioners were surveyed to identify barriers to recording data and strategies to improve its completeness.
Over the study period data completeness improved up to around 80% complete although the recording of weight remained poorer at 55%. T2D patients with Ischaemic Heart Disease were more likely to have their blood pressure recorded (OR 1.6, p = 0.02). Practitioners reported not experiencing any major barriers to using their computer medical record system but did agree with some suggested strategies to improve record completeness.
The completeness of routinely collected data suitable for input into computerised predictive models is improving although other dimensions of data quality need to be addressed.
Automatic Annotation of Training Data for Deep Learning Based De-identification of Narrative Clinical Text
2023, CEUR Workshop Proceedings
De-identifying Spanish medical texts - named entity recognition applied to radiology reports
2021, Journal of Biomedical Semantics

View all citing articles on Scopus

View full text

Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records

Highlights

Abstract

Purpose

Methods

Results

Conclusion

Introduction

Section snippets

Definition of FASDIM

Material and method of the evaluation

First evaluation phase

Discussion

Conclusion

Role of the funding source

Author contributions

Conflicts of interest

Acknowledgment

J. Am. Med. Inform. Assoc.

J. Am. Med. Inform. Assoc.

Int. J. Med. Inform.

Int. J. Med. Inform.

J. Am. Med. Inform. Assoc.

Artif. Intell. Med.

J. Am. Med. Inform. Assoc.

J. Am. Med. Inform. Assoc.

Int. J. Med. Inform.

Int. J. Med. Inform.

J. Am. Med. Inform. Assoc.

Summary of the HIPAA Privacy Rule

Assessing the difficulty and time cost of de-identification in clinical narratives

Methods Inf. Med.

Automatic de-identification of textual documents in the electronic health record: a review of recent research

BMC Med. Res. Methodol.

Development and evaluation of an open source software tool for deidentification of pathology reports

BMC Med. Inform. Decis. Mak.

Concept-match medical data scrubbing. How pathology text can be used in research

Arch. Pathol. Lab. Med.

Algorithmic De-identification of VA Medical Exam Text for HIPAA Privacy Compliance: Preliminary Findings; Medinfo

Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research

Am. J. Clin. Pathol.