ABSTRACT
Objective We developed a post-processing algorithm to convert raw natural language processing (NLP) output from electronic health records (EHRs) into a usable format for analysis. This algorithm was specifically developed for creating datasets for use in medication-based studies.
Materials and Methods The algorithm was developed using output from two NLP systems, MedXN and medExtractR. We extracted medication information from deidentified clinical notes from Vanderbilt’s EHR system for two medications, tacrolimus and lamotrigine. The algorithm consists of two parts. Part I parses the raw NLP output and connects entities together. Part II removes redundancies and calculates dose intake and daily dose. We evaluated each part by comparing to human-determined gold standards that were generated using approximately 300 records from 10 subjects for each medication and each NLP system.
Results The algorithm performed well. For MedXN, the F-measures were at or above 0.99 for Part I and at or above 0.97 for Part II. For medExtractR, the F-measures for Part I were 1.00 and for Part II they were at or above 0.98.
Discussion Our post-processing algorithm was developed separately from an NLP system, making it easier to modify and generalize to other systems. It performed well to convert NLP output to analyzable data, but it cannot perform well in certain cases, such as when incorrect information is extracted by the NLP system.
Conclusion Our post-processing algorithm provides a way to convert raw NLP output to a form that is useful for medication-based studies, leading to more opportunities to use EHR data for diverse studies.