Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records

https://doi.org/10.1016/j.ijmedinf.2013.11.005Get rights and content

Highlights

  • Usual de-identification methods rely on either natural language processing or pattern matching.

  • Those methods require pre-existent material in the appropriate language, respectively, dictionaries or de-identified documents.

  • FASDIM is a new method based on pattern matching, where no pre-existent material is required: the words are filtered by the operator on-the-fly.

  • FASDIM has been tested on 508 French discharge summaries and obtains: recall = 98.1% (no name remains), precision = 79.6%, F-measure = 87.9%.

  • The reports are encoded before and after de-identification: 99.0% of the codes (ICD10, ATC, CCAM for procedures) are preserved.

Abstract

Purpose

Medical free-text records enable to get rich information about the patients, but often need to be de-identified by removing the Protected Health Information (PHI), each time the identification of the patient is not mandatory. Pattern matching techniques require pre-defined dictionaries, and machine learning techniques require an extensive training set. Methods exist in French, but either bring weak results or are not freely available. The objective is to define and evaluate FASDIM, a Fast And Simple De-Identification Method for French medical free-text records.

Methods

FASDIM consists in removing all the words that are not present in the authorized word list, and in removing all the numbers except those that match a list of protection patterns. The corresponding lists are incremented in the course of the iterations of the method.

For the evaluation, the workload is estimated in the course of records de-identification. The efficiency of the de-identification is assessed by independent medical experts on 508 discharge letters that are randomly selected and de-identified by FASDIM. Finally, the letters are encoded after and before de-identification according to 3 terminologies (ATC, ICD10, CCAM) and the codes are compared.

Results

The construction of the list of authorized words is progressive: 12 h for the first 7000 letters, 16 additional hours for 20,000 additional letters. The Recall (proportion of removed Protected Health Information, PHI) is 98.1%, the Precision (proportion of PHI within the removed token) is 79.6% and the F-measure (harmonic mean) is 87.9%. In average 30.6 terminology codes are encoded per letter, and 99.02% of those codes are preserved despite the de-identification.

Conclusion

FASDIM gets good results in French and is freely available. It is easy to implement and does not require any predefined dictionary.

Introduction

Computerized free-text medical records are important information sources for research. In most countries, each time a patient is discharged from a healthcare facility, a discharge letter has to be written: it summarizes all the pertinent information from the reason for admission to the discharge drug treatment. Those letters are routinely produced and provide the researchers with a big amount of medical information. On the other hand, the confidentiality must imperatively be respected: as soon as a discharge letter is not used with direct benefit to the patient and if the patient does not need to be identified, the letter must be de-identified. The anonymization consists in removing the patients’ names from the records: unfortunately, other pieces of information enable to identify the patients. The de-identification is a more exhaustive removal of the entire Protected Health Information (PHI), so that the patients cannot be identified, directly nor indirectly. In the US, privacy rules have been enacted by the Department of Health and Human Services further to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) [1]. In order to de-identify a high number of records, it is necessary to use automated methods, as manual methods require too high workload [2].

Several methods exist for automated de-identification of free-text records [3], including procedures reports and discharge letters.

Pattern matching methods [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] consist in applying rules that enable to keep or remove some words that belong to dictionaries that have been predefined by experts or institutions. For instance, it is possible to remove all the words that belong to a list of town names, or to preserve all the words that belong to a list of medical terms (such as the Unified Medical Language System [17]). Additional rules may be used to take into account words declension and verbs conjugation. This approach requires that such lists are available. When they exist, those lists are language-dependent, and are suitable for a specific context only (e.g. town names or current family names are useless in another country).

Machine learning methods [14], [18], [19], [20], [21], [22], [23], [24], [25], [26] are derived from artificial intelligence. A learning phase requires that a corpus of records is previously de-identified manually by experts. Those methods are often very efficient, depending on the quality and the completeness of the learning corpus.

Whatever the method used, the de-identification is evaluated by computing three rates:

  • -

    The recall (or sensitivity or completeness, Eq. (1)), which is the proportion of removed token within the PHI. A high recall enables to preserve the confidentiality.

  • -

    The precision (or positive predictive value or correctness, Eq. (2)), which is the proportion of PHI within the removed token. A high precision enables to preserve the readability of the text.

  • -

    The F-Measure, which is the harmonic mean of the recall and the precision (Eq. (3)).

recall=R=TP#identifiers=TPTP+FNprecision=P=TP#removed=TPTP+FPF-measure=F=R1+P121

Table 1 presents the main results obtained in the literature by the authors for medical free-text de-identification. Most of methods are developed for English language and can hardly be used for other languages. Some methods have been developed in French, but either their results are disappointing, or they are not freely available.

Despite the good results obtained by many methods, text de-identification is still not obvious and some situations may not be addressed with current tools. We shall illustrate it through 4 use cases.

Case 1: a team has to de-identify English free-text records using pattern-matching. Some tools are freely available. However, it cannot be guaranteed that those tools could be applied in a different context without any adaptation. Indeed, pattern matching techniques rely on lists of words that are context-dependent: for instance “lime tree” should be removed in most reports as it is often part of a street name, but should not be removed in an allergy-related report. Lists of town names or family names also depend on the country. Finally, misspellings are most often not taken into account by existing methods.

Case 2: a team has to de-identify English free-text records using machine learning. Here again, some tools are freely available but, in a like manner, machine learning techniques require a pre-existing corpus of de-identified records. Such corpuses are available in English [11], [36], [37], but they may be used only if the type of document to de-identify is the same as the documents of the training corpus.

Case 3: a team has to de-identify French free-text records (the problem is the same with most of non-English languages): no free and efficient method, no list of words, and no training corpus are available. Everything has to be built.

Case 4: a team has only little time (e.g. 1 man-week) to de-identify a few records (e.g. 25,000 records). Whatever the language, the context and the technique, it will probably take more time to understand, adapt, implement and execute an existing tool.

The conception of FASDIM relies on the idea that a simple de-identification technique could enable to de-identify French discharge letters with an acceptable workload, particularly when the number of records is low. The main idea is to supply the workload in the course of the method, and not before the first document can be de-identified.

The first general objective of this work is to design and implement FASDIM, a Fast And Simple De-Identification Method for clinical free-text records. The second general objective is to evaluate the method.

To reach the first general objective, operational objectives are (1) to design a method that reaches good results in French using completely unstructured free-text records, but (2) is as independent as possible from the language structure (i.e. for instance does not consider the declension of words and the conjugation of verbs) and (3) does not rely on any pre-existing material (list of words or corpus of de-identified documents), in order to (4) be easily and fast reproducible from scratch by any hospital or research team.

To reach the second general objective, operational objectives are (5) to objectively compute traditional evaluation metrics but also (6) to evaluate the preservation of medical information and (7) to evaluate the workload required to implement the method.

The method is implemented and evaluated in French, but the examples that are given in this paper here are translated into English.

Section snippets

Definition of FASDIM

FASDIM stands for Fast And Simple De-Identification Method. This method is composed of 3 steps (Fig. 1).

Material and method of the evaluation

The FASDIM method has been first developed to meet the needs of a research project, with imposed deadlines: that explains why the numbers of records at each step are not regular. Seven successive sets of unstructured discharge letters are extracted from the HIS of a general French hospital:

  • A first set of 20 records used to develop and test the method.

  • Successive cumulative sets of records: 7012 then 9503 then 16,009 then 17,812 then 23,493 letters.

  • Finally, from the last cumulated extraction of

First evaluation phase

The accuracy of the de-identification is evaluated using 508 discharge letters (Table 3). The recall is 98.1% [97.8%; 98.4%], the precision is 79.6% [78.9%; 80.3%] and the F-measure is 87.9%. Many auxiliary verbs are over-scrubbed because they stand nearby family names, but it does not alter the legibility of the text. If the suppression of auxiliary verbs is ignored, the precision reaches 89.2% and the F-Measure reaches 93.4%.

In average, 0.36 PHI token are inappropriately preserved per letter

Discussion

In this work, FASDIM, a Fast And Simple De-Identification Method for clinical unstructured free-text records, has been defined and evaluated in French language. The operational objectives defined in Section 1 have been reached (Table 6).

Objectives 1, 5 and 6: the method reaches very good results in French. The recall is 98.1%, the precision is 79.6% and the F-measure is 87.9%. A less strict evaluation gives a precision of 89.2% and an F-Measure of 93.4%. Moreover, 99.0% of the medical

Conclusion

FASDIM is a fast and simple algorithm that enables to de-identify French free-text discharge letters. It preserves the patient confidentiality without threatening medical information. Is seems to be suitable especially when a medium corpus of letters has to be de-identified in a limited amount of time. Examples of source code and lists of words are freely available on the web [27]. The same method should be experimented and evaluated on other types of texts, including less formal texts (such as

Role of the funding source

The study sponsor had no role on the study or the writing of the manuscript.

Author contributions

Emmanuel Chazard and Grégoire Ficheur have designed and implemented FASDIM. Capucine Mouret has performed the bibliographic analysis. Emmanuel Chazard and Capucine Mouret have designed the evaluation methodology. Capucine Mouret, Aurélien Schaffar and Jean-Baptiste Beuscart have performed the evaluation of FASDIM. The article has been written by Emmanuel Chazard and Capucine Mouret, and has been reviewed by all the authors, especially Régis Beuscart who is the department chair.

Conflicts of interest

The authors are employed by a public organization. There is no business exploitation or patent of the method presented and evaluated in this paper.

The evaluators and the developers of the system tested are not the same persons, do not belong to the same department, but are employed by the same hospital and frequently work together.

Summary points

What was already known:

  • Several methods exist for free-text de-identification.

  • Pattern matching methods require that dictionaries are already available.

Acknowledgment

The research leading to these results has received funding from the European Community's Seventh Framework Program (FP7/2007–2013) under Grant Agreement no. 216130 – the PSIP project.

References (37)

  • O. Uzuner et al.

    Evaluating the state-of-the-art in automatic de-identification

    J. Am. Med. Inform. Assoc.

    (2007)
  • Summary of the HIPAA Privacy Rule

    (2013)
  • D.A. Dorr et al.

    Assessing the difficulty and time cost of de-identification in clinical narratives

    Methods Inf. Med.

    (2006)
  • S.M. Meystre et al.

    Automatic de-identification of textual documents in the electronic health record: a review of recent research

    BMC Med. Res. Methodol.

    (2010)
  • B.A. Beckwith et al.

    Development and evaluation of an open source software tool for deidentification of pathology reports

    BMC Med. Inform. Decis. Mak.

    (2006)
  • J.J. Berman

    Concept-match medical data scrubbing. How pathology text can be used in research

    Arch. Pathol. Lab. Med.

    (2003)
  • E.M. Fielstein et al.

    Algorithmic De-identification of VA Medical Exam Text for HIPAA Privacy Compliance: Preliminary Findings; Medinfo

    (2004)
  • D. Gupta et al.

    Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research

    Am. J. Clin. Pathol.

    (2004)
  • Cited by (21)

    • A machine learning based approach to identify protected health information in Chinese clinical text

      2018, International Journal of Medical Informatics
      Citation Excerpt :

      Methods for automatic de-identification applied in medical domain are mostly classified into the rule-based methods, machine learning based methods, and some hybrid approaches. Rule-based clinical text de-identification systems [19,20] use an array of rules, patterns and specialized semantic dictionaries to identify PHI in medical records. Although these systems can be easily and quickly modified by adding rules, dictionary terms or regular expressions to improve the overall performance, they are often criticized for the limited generalizability that depends on the quality of the patterns and rules.

    • Co-prescriptions of psychotropic drugs to older patients in a general hospital

      2017, European Geriatric Medicine
      Citation Excerpt :

      The following data were encoded in the database: demographic and administrative data; diagnostic data, according to the International Classification of Diseases; 10th Revision (ICD-10) [23]; diagnostic and therapeutic procedures, according to the French Classification commune des actes médicaux (CCAM) classification; medication data, according to the Anatomical Therapeutic Chemical (ATC) Classification [24]; laboratory results, according to the Committee on Nomenclature, Properties and Units classification (IUPAC) [25]. The text of discharge letters and hospital care reports were anonymized (using the “Fast And Simple De-Identification Method”) and encoded as text files [26]. The database also listed all medications administered to the patient by medical staff during the stay.

    • The completeness of electronic medical record data for patients with Type 2 Diabetes in primary care and its implications for computer modelling of predicted clinical outcomes

      2016, Primary Care Diabetes
      Citation Excerpt :

      Previous research has identified that barriers to physicians entering clinical data as coded entries include time constraints during consultations [18], issues with software interfaces and codes [18,19], and the under appreciation of the usefulness of coded data as a quality indicator [20]. Although attempts have been made to extract clinical observation data from free text fields [21,22] such approaches are likely to have inherent limitations related to variations in users’ text recording practices. A recent literature review looking at routinely collected electronic clinical data and chronic disease management identified completeness, accuracy, correctness and timeliness as major dimensions that need to be considered when assessing data quality for both research and patient care purposes [23].

    View all citing articles on Scopus
    View full text