A study of EMR-based medical knowledge network and its applications

https://doi.org/10.1016/j.cmpb.2017.02.016Get rights and content

Highlights

  • We construct an EMR-based medical knowledge network (EMKN) to represent medical knowledge.

  • The network shows some complex properties, including small-world property, scale-free property and community structure.

  • Based on the network, we propose a simple but universal diagnosis model. This model takes symptom entities as inputs, and returns the disease entities as outputs.

  • This model only needs simple features as inputs, and is not restricted to a specific disease.

  • Our study results experimentally demonstrate the effectiveness of EMKN.

Abstract

Background and Objective

Electronic medical records (EMRs) contain an amount of medical knowledge which can be used for clinical decision support. We attempt to integrate this medical knowledge into a complex network, and then implement a diagnosis model based on this network.

Methods

The dataset of our study contains 992 records which are uniformly sampled from different departments of the hospital. In order to integrate the knowledge of these records, an EMR-based medical knowledge network (EMKN) is constructed. This network takes medical entities as nodes, and co-occurrence relationships between the two entities as edges. Selected properties of this network are analyzed. To make use of this network, a basic diagnosis model is implemented. Seven hundred records are randomly selected to re-construct the network, and the remaining 292 records are used as test records. The vector space model is applied to illustrate the relationships between diseases and symptoms. Because there may exist more than one actual disease in a record, the recall rate of the first ten results, and the average precision are adopted as evaluation measures.

Results

Compared with a random network of the same size, this network has a similar average length but a much higher clustering coefficient. Additionally, it can be observed that there are direct correlations between the community structure and the real department classes in the hospital. For the diagnosis model, the vector space model using disease as a base obtains the best result. At least one accurate disease can be obtained in 73.27% of the records in the first ten results.

Conclusion

We constructed an EMR-based medical knowledge network by extracting the medical entities. This network has the small-world and scale-free properties. Moreover, the community structure showed that entities in the same department have a tendency to be self-aggregated. Based on this network, a diagnosis model was proposed. This model uses only the symptoms as inputs and is not restricted to a specific disease. The experiments conducted demonstrated that EMKN is a simple and universal technique to integrate different medical knowledge from EMRs, and can be used for clinical decision support.

Introduction

The electronic medical record (EMR) is the storage of all health care data and medical history of a patient in an electronic format [17]. These data include abundant medical knowledge, such as the current clinical diagnosis, medical history, results of investigations, treatment plans, and so on [12]. This is a novel and rich resource for clinical research [26], [44]. As the quantity of EMRs increase rapidly, medical professionals are overwhelmed by this ever-expanding knowledge. Therefore, several effective methods are being developed to assist these professionals [47].

Entities, and relationships between entities, are the primary carriers of medical knowledge in EMRs, and can be extracted by a natural language processing (NLP) technique [44]. Thus, entity recognition [25] and entity relationship extraction become the key tasks in the knowledge extraction of EMRs [29], [47]. Additionally, these tasks are an application for NLP in biomedical informatics [38], and have become the basis of many other tasks [28], [37].

If we regard the extracted entities as nodes, and entity relationships between two entities as edges, the knowledge derived from EMRs can be represented as a network. Network based methods have been extensively used for medical knowledge representation and inference. These networks can be divided into two major groups: Bayesian networks (probabilistic graphical models, more universal), and complex networks.

Many existing studies have attempted to diagnose or predict diseases based on Bayesian networks [9], [13], [14], [19], [27], [42]. One of the earliest, and most renowned diagnosis system is Pathfinder, which is a decision-theoretic expert system for hematopathology diagnosis [19]. Since that time, the Bayesian networks have played important roles in successive studies. Klann et al. [27] implemented an adaptive recommendation system to recommend a next order of treatment menu, based on the previous orders. Velikova et al. [42] built a probabilistic disease model for preeclampsia using the temporal Bayesian network and implemented it as part of a real-world home-monitoring system for personalized pregnancy care. Flores et al. [13] presented a methodology for incorporating expert knowledge as structural priors for learning the Bayesian networks, and applied it to the study of heart failures. As a probabilistic inference paradigm, Bayesian networks are suitable to model complex interactions between medical entities, and have an even higher accuracy than clinicians in some cases. However, the computational complexity of the structure, and parameter learning of Bayesian networks can become intractable once the number of nodes becomes larger [7]. Hundreds of nodes are difficult for Bayesian networks to handle, thus, the network can only be used in a single, specific field.

On the contrary, the complex network based models can analyze the large-scale data appropriately, however, this is not a specialized model for knowledge representation and inference. Complex networks refer to the large scale networks having small-world and scale-free properties [2], [45]. Compared with the random networks, in which any two nodes are independently connected with a fixed probability [11], [41], complex networks have a small number of nodes with a much larger number of connections. Studies focused on the network system have pointed out that many real networks are complex networks, rather than random networks [8], [10].

In the medical domain, the complex network plays an instrumental role by providing conceptual insights, as well as offering visual and computational methodology [15]. It is widely applied to disease-gene [16], protein-protein [43], disease-symptom [51] interaction analysis, as well as the field of pharmacology [49], [50], epidemiology [20], genetics [30] and brain science [24]. Goh et al. [16] proposed diseasome, a bipartite graph consisting of the disease and gene nodes. A disease and gene was connected, if mutations in that gene were implicated in that disease. Zhou et al. [51] constructed a symptom-based human disease network from biomedical literature databases, and investigated the connection between clinical manifestations of diseases and their underlying molecular interactions. Tachimori et al. [39] constructed a medical network utilizing a medical textbook. This network had the small-world and scale-free features, but did not show a knowledge inferring ability. To our knowledge, medical networks derived from EMRs have not yet been reported.

We believe such complex networks can support clinical decisions. Applying the artificial intelligence (AI) approach to disease diagnostics is receiving significant attention in the field of medical informatics. Typically, existing studies of AI diagnosis focus on one specific disease. Alizadehsani et al. [1] selected four combinations of related features for coronary artery disease. Then, they applied sequential minimal optimization, and other algorithms, in these groups of features and reached the best accuracy of 94.08%. Rau et al. [35] proposed a prediction model for developing liver cancer in type 2 diabetes mellitus patients. They identified 10 risk factors as variables, then constructed an artificial neural network (ANN) and logistic regression (LR) prediction models. The best results of sensitivity and specificity were 0.757 and 0.755, respectively. Hariharana et al. [18] selected 22 raw features from the voice signals of people diagnosed with Parkinson’s disease. Their diagnosis model consisted of feature pre-processing, feature reduction/selection, and different feature classifiers, using a support vector machine (SVM) and neural network. They obtained a classification accuracy of 100% for the test set.

In these studies, corresponding features of patients were manually selected, then the diagnostic problems were converted into classification tasks. After the feature selection and reduction was completed, classifiers were used to discriminate whether the patients had the disease. It could be observed that the classifier of a specific disease could obtain sufficient results to be used for clinical experience, and support a clinical decision. However, this type of study focused more on feature extraction and feature pre-processing. Good features can usually get satisfactory performance, while bad features cannot accurately distinguish the positive cases from negative ones. Good feature selection needs professional medical knowledge. Moreover, the model can only be applied to a certain disease. If people have certain symptoms, and want to know what affliction they might have, these systems are not of much help. We attempt to exploit the possibility of constructing a more universal diagnosis model, utilizing simple features (like symptoms) as inputs and not restricting it to a specific disease.

In this paper, we construct EMKN, a new Medical Knowledge Network based on EMRs. Nodes of this network are medical entities and edges are co-occurrence relationships between entities in the same records. The more frequently an entity pair occurs together, the larger the edge weight between this pair becomes. We calculate primary quantities of EMKN and validate its complex properties. Next, we propose a basic, universal diagnosis model using EMKN to show that EMKN can indeed express medical knowledge. We apply this model to real EMRs and prove its effectiveness.

There may be two innovations in this paper:

  • We construct EMKN, an EMR-based medical knowledge network, to represent medical knowledge. Then, we find the complex properties of this network.

  • We propose a simple, but universal diagnosis model based on the network. This model takes symptom entities as inputs, and can be used for multi-disease diagnosis.

The remainder of this paper is structured as follows. In Section 2, we give a brief introduction to our EMR corpus, then provide the construction approach and property analysis of EMKN. In Section 3, we describe the details of the diagnosis model. Then, we evaluate our model using actual records in Section 4, and give the results with discussion. Conclusion and future studies will be presented in Section 5.

Section snippets

Medical entities and assertions

Referencing the medical concept annotation guideline and assertion annotation guideline given by Informatics for Integrating Biology and the Bedside (i2b2) [21], [22], we have created guidelines for Chinese EMRs, and manually annotated 992 records as the corpus under the guidance of medical professionals [46]. These records were retrieved from The Second Affiliated Hospital of Harbin Medical University, and contained 887 individual patients. We have obtained the usage rights for these records. 1

Disease inference

In medical diagnosis, we are typically faced with a set of symptoms and test results. Then, we attempt to find the cause behind them. To make use of the medical knowledge of EMKN, we designed a diagnosis model, which can infer the corresponding diseases with the given symptoms.

We think symptoms are caused by diseases. If we view the symptoms as vectors based on the diseases, we can describe the causality to some degree. Suppose there are N distinct disease entities. Then, the symptom vector

Experiment setup

To evaluate our model, we randomly selected 700 records from our corpus as the training set, and 292 records remained as the test set. We re-constructed the SDEMKN using the new training data. In the process of entity extraction, we added simple rules to unify parts of entities which had the same meanings. For a test record, we took symptoms with positive assertions as input to infer the probable diagnosis result. Diseases in the record with non-absent assertion were regarded as the actual

Conclusion

We constructed EMKN, an EMR-based Medical Knowledge Network, using a manually annotated EMR corpus. This network took medical entities as nodes and the co-occurrence relationships of entities in the same record as edges. We obtained main quantities of EMKN and validated its small-world and scale-free properties. In addition, we illustrated that the community structure of this network was associated with the real department information.

According to the subgraph SDEMKN, we represented the

Acknowledgment

This work is supported by the Natural Science Foundation of China (No. 61672185). We thank the Second Affiliated Hospital of Harbin Medical University for providing the corpus used in this study. We also thank the anonymous reviewers for their comments, which provided us with significant guidance.

References (51)

  • M. Pettorruso et al.

    Punding in non-demented parkinson’s disease patients: relationship with psychiatric and addiction spectrum comorbidity

    J. Neurol. Sci.

    (2016)
  • R.A.A. Seoud et al.

    Tmt-hcc: a tool for text mining the biomedical literature for hepatocellular carcinoma (hcc) biomarkers identification

    Comput. Methods Programs Biomed.

    (2013)
  • Y. Tachimori et al.

    The networks from medical knowledge and clinical practice have small-world, scale-free, and hierarchical features

    Physica A

    (2013)
  • M. Velikova et al.

    Exploiting causal functional relationships in bayesian network modelling for personalised healthcare

    Int. J. Approximate Reasoning

    (2014)
  • R.C. Wasserman

    Electronic medical records (emrs), epidemiology, and epistemology: reflections on emrs and future pediatric clinical research

    Acad. Pediatr.

    (2011)
  • W. Zhang et al.

    A comparative study of tf* idf, lsi and multi-words for text classification

    Expert Syst. Appl.

    (2011)
  • X. Zhou et al.

    Key technology based on network pharmacology of complex networks

    2016 IEEE International Conference on Big Data Analysis (ICBDA)

    (2016)
  • A.-L. Barabási et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • M. Bastian et al.

    Gephi: an open source software for exploring and manipulating networks.

    ICWSM

    (2009)
  • Y. Bengio et al.

    A neural probabilistic language model

    J.Mach.Learn.Res.

    (2003)
  • V.D. Blondel et al.

    Fast unfolding of communities in large networks

    J. Stat. Mech.

    (2008)
  • C. Buckley et al.

    Evaluating evaluation measure stability

    Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

    (2000)
  • D.M. Chickering et al.

    Large-sample learning of bayesian networks is np-hard

    J. Mach. Learn. Res.

    (2004)
  • S.N. Dorogovtsev et al.

    Critical phenomena in complex networks

    Rev. Mod. Phys.

    (2008)
  • P. Erds et al.

    On random graphs i

    Publicationes Mathematicae

    (1959)
  • Cited by (27)

    • HKGB: An Inclusive, Extensible, Intelligent, Semi-auto-constructed Knowledge Graph Framework for Healthcare with Clinicians’ Expertise Incorporated

      2020, Information Processing and Management
      Citation Excerpt :

      The data in the medical domain comes from many data sources. Some Big Data vendors collect and store a large number of Electronic Medical Records (EMRs) in hospitals, and these records contain valuable medical information (Zhao, Jiang, Xu, & Guan, 2017). EMRs have the advantage of being closer to the actual practice of medication than the pedagogical and curated information presented in textbooks and research publications.

    • Learning an expandable EMR-based medical knowledge network to enhance clinical diagnosis

      2020, Artificial Intelligence in Medicine
      Citation Excerpt :

      In our EMR-based MKN, all edges only exist between symptom-disease pairs, which means the same type of entities are not connected. Our previous work [39] has proved that the EMR-based MKN has small-world and scale-free properties, thus, we measured the topology of the network by the average degree, the mean shortest distance between two nodes and the diameter of the maximum connectivity subgraph [48], which often used in complex networks. A higher accuracy indicated a stronger ability of the network to diagnose this disease.

    • Real-world data medical knowledge graph: construction and applications

      2020, Artificial Intelligence in Medicine
      Citation Excerpt :

      Rotmensch M et al. [26] builds a graph of 156 diseases and 491 symptoms based on 273,174 patient visits to the emergency department. Zhao C et al. [27] constructs an EMR-based medical knowledge network (EMKN) by extracting the medical entities, which contains 6733 nodes and 154,462 edges. Based on the network, a diagnosis model based on only symptoms is proposed.

    View all citing articles on Scopus
    View full text