A study of EMR-based medical knowledge network and its applications
Introduction
The electronic medical record (EMR) is the storage of all health care data and medical history of a patient in an electronic format [17]. These data include abundant medical knowledge, such as the current clinical diagnosis, medical history, results of investigations, treatment plans, and so on [12]. This is a novel and rich resource for clinical research [26], [44]. As the quantity of EMRs increase rapidly, medical professionals are overwhelmed by this ever-expanding knowledge. Therefore, several effective methods are being developed to assist these professionals [47].
Entities, and relationships between entities, are the primary carriers of medical knowledge in EMRs, and can be extracted by a natural language processing (NLP) technique [44]. Thus, entity recognition [25] and entity relationship extraction become the key tasks in the knowledge extraction of EMRs [29], [47]. Additionally, these tasks are an application for NLP in biomedical informatics [38], and have become the basis of many other tasks [28], [37].
If we regard the extracted entities as nodes, and entity relationships between two entities as edges, the knowledge derived from EMRs can be represented as a network. Network based methods have been extensively used for medical knowledge representation and inference. These networks can be divided into two major groups: Bayesian networks (probabilistic graphical models, more universal), and complex networks.
Many existing studies have attempted to diagnose or predict diseases based on Bayesian networks [9], [13], [14], [19], [27], [42]. One of the earliest, and most renowned diagnosis system is Pathfinder, which is a decision-theoretic expert system for hematopathology diagnosis [19]. Since that time, the Bayesian networks have played important roles in successive studies. Klann et al. [27] implemented an adaptive recommendation system to recommend a next order of treatment menu, based on the previous orders. Velikova et al. [42] built a probabilistic disease model for preeclampsia using the temporal Bayesian network and implemented it as part of a real-world home-monitoring system for personalized pregnancy care. Flores et al. [13] presented a methodology for incorporating expert knowledge as structural priors for learning the Bayesian networks, and applied it to the study of heart failures. As a probabilistic inference paradigm, Bayesian networks are suitable to model complex interactions between medical entities, and have an even higher accuracy than clinicians in some cases. However, the computational complexity of the structure, and parameter learning of Bayesian networks can become intractable once the number of nodes becomes larger [7]. Hundreds of nodes are difficult for Bayesian networks to handle, thus, the network can only be used in a single, specific field.
On the contrary, the complex network based models can analyze the large-scale data appropriately, however, this is not a specialized model for knowledge representation and inference. Complex networks refer to the large scale networks having small-world and scale-free properties [2], [45]. Compared with the random networks, in which any two nodes are independently connected with a fixed probability [11], [41], complex networks have a small number of nodes with a much larger number of connections. Studies focused on the network system have pointed out that many real networks are complex networks, rather than random networks [8], [10].
In the medical domain, the complex network plays an instrumental role by providing conceptual insights, as well as offering visual and computational methodology [15]. It is widely applied to disease-gene [16], protein-protein [43], disease-symptom [51] interaction analysis, as well as the field of pharmacology [49], [50], epidemiology [20], genetics [30] and brain science [24]. Goh et al. [16] proposed diseasome, a bipartite graph consisting of the disease and gene nodes. A disease and gene was connected, if mutations in that gene were implicated in that disease. Zhou et al. [51] constructed a symptom-based human disease network from biomedical literature databases, and investigated the connection between clinical manifestations of diseases and their underlying molecular interactions. Tachimori et al. [39] constructed a medical network utilizing a medical textbook. This network had the small-world and scale-free features, but did not show a knowledge inferring ability. To our knowledge, medical networks derived from EMRs have not yet been reported.
We believe such complex networks can support clinical decisions. Applying the artificial intelligence (AI) approach to disease diagnostics is receiving significant attention in the field of medical informatics. Typically, existing studies of AI diagnosis focus on one specific disease. Alizadehsani et al. [1] selected four combinations of related features for coronary artery disease. Then, they applied sequential minimal optimization, and other algorithms, in these groups of features and reached the best accuracy of 94.08%. Rau et al. [35] proposed a prediction model for developing liver cancer in type 2 diabetes mellitus patients. They identified 10 risk factors as variables, then constructed an artificial neural network (ANN) and logistic regression (LR) prediction models. The best results of sensitivity and specificity were 0.757 and 0.755, respectively. Hariharana et al. [18] selected 22 raw features from the voice signals of people diagnosed with Parkinson’s disease. Their diagnosis model consisted of feature pre-processing, feature reduction/selection, and different feature classifiers, using a support vector machine (SVM) and neural network. They obtained a classification accuracy of 100% for the test set.
In these studies, corresponding features of patients were manually selected, then the diagnostic problems were converted into classification tasks. After the feature selection and reduction was completed, classifiers were used to discriminate whether the patients had the disease. It could be observed that the classifier of a specific disease could obtain sufficient results to be used for clinical experience, and support a clinical decision. However, this type of study focused more on feature extraction and feature pre-processing. Good features can usually get satisfactory performance, while bad features cannot accurately distinguish the positive cases from negative ones. Good feature selection needs professional medical knowledge. Moreover, the model can only be applied to a certain disease. If people have certain symptoms, and want to know what affliction they might have, these systems are not of much help. We attempt to exploit the possibility of constructing a more universal diagnosis model, utilizing simple features (like symptoms) as inputs and not restricting it to a specific disease.
In this paper, we construct EMKN, a new Medical Knowledge Network based on EMRs. Nodes of this network are medical entities and edges are co-occurrence relationships between entities in the same records. The more frequently an entity pair occurs together, the larger the edge weight between this pair becomes. We calculate primary quantities of EMKN and validate its complex properties. Next, we propose a basic, universal diagnosis model using EMKN to show that EMKN can indeed express medical knowledge. We apply this model to real EMRs and prove its effectiveness.
There may be two innovations in this paper:
- •
We construct EMKN, an EMR-based medical knowledge network, to represent medical knowledge. Then, we find the complex properties of this network.
- •
We propose a simple, but universal diagnosis model based on the network. This model takes symptom entities as inputs, and can be used for multi-disease diagnosis.
The remainder of this paper is structured as follows. In Section 2, we give a brief introduction to our EMR corpus, then provide the construction approach and property analysis of EMKN. In Section 3, we describe the details of the diagnosis model. Then, we evaluate our model using actual records in Section 4, and give the results with discussion. Conclusion and future studies will be presented in Section 5.
Section snippets
Medical entities and assertions
Referencing the medical concept annotation guideline and assertion annotation guideline given by Informatics for Integrating Biology and the Bedside (i2b2) [21], [22], we have created guidelines for Chinese EMRs, and manually annotated 992 records as the corpus under the guidance of medical professionals [46]. These records were retrieved from The Second Affiliated Hospital of Harbin Medical University, and contained 887 individual patients. We have obtained the usage rights for these records. 1
Disease inference
In medical diagnosis, we are typically faced with a set of symptoms and test results. Then, we attempt to find the cause behind them. To make use of the medical knowledge of EMKN, we designed a diagnosis model, which can infer the corresponding diseases with the given symptoms.
We think symptoms are caused by diseases. If we view the symptoms as vectors based on the diseases, we can describe the causality to some degree. Suppose there are N distinct disease entities. Then, the symptom vector
Experiment setup
To evaluate our model, we randomly selected 700 records from our corpus as the training set, and 292 records remained as the test set. We re-constructed the SDEMKN using the new training data. In the process of entity extraction, we added simple rules to unify parts of entities which had the same meanings. For a test record, we took symptoms with positive assertions as input to infer the probable diagnosis result. Diseases in the record with non-absent assertion were regarded as the actual
Conclusion
We constructed EMKN, an EMR-based Medical Knowledge Network, using a manually annotated EMR corpus. This network took medical entities as nodes and the co-occurrence relationships of entities in the same record as edges. We obtained main quantities of EMKN and validated its small-world and scale-free properties. In addition, we illustrated that the community structure of this network was associated with the real department information.
According to the subgraph SDEMKN, we represented the
Acknowledgment
This work is supported by the Natural Science Foundation of China (No. 61672185). We thank the Second Affiliated Hospital of Harbin Medical University for providing the corpus used in this study. We also thank the anonymous reviewers for their comments, which provided us with significant guidance.
References (51)
- et al.
A data mining approach for diagnosis of coronary artery disease
Comput. Methods Programs Biomed.
(2013) - et al.
Approaching human language with complex networks
Phys. Life Rev.
(2014) - et al.
From complex questionnaire and interviewing data to intelligent bayesian network models for medical decision support
Artif. Intell. Med.
(2016) - et al.
Bayesian network modeling: a case study of an epidemiologic system analysis of cardiovascular risk
Comput. Methods Programs Biomed.
(2016) - et al.
A new hybrid intelligent system for accurate detection of parkinson’s disease
Comput. Methods Programs Biomed.
(2014) - et al.
Cancer-disease associations: a visualization and animation through medical big data
Comput. Methods Programs Biomed.
(2016) - et al.
Enhancing medical named entity recognition with an extended segment representation technique
Comput. Methods Programs Biomed.
(2015) - et al.
Decision support from local data: creating adaptive order menus from past clinician behavior
J. Biomed. Inf.
(2014) - et al.
Bioannote: a software platform for annotating biomedical documents with application in medical learning environments
Comput. Methods Programs Biomed.
(2013) - et al.
Temporal data representation, normalization, extraction, and reasoning: a review from clinical domain
Comput. Methods Programs Biomed.
(2016)
Punding in non-demented parkinson’s disease patients: relationship with psychiatric and addiction spectrum comorbidity
J. Neurol. Sci.
Tmt-hcc: a tool for text mining the biomedical literature for hepatocellular carcinoma (hcc) biomarkers identification
Comput. Methods Programs Biomed.
The networks from medical knowledge and clinical practice have small-world, scale-free, and hierarchical features
Physica A
Exploiting causal functional relationships in bayesian network modelling for personalised healthcare
Int. J. Approximate Reasoning
Electronic medical records (emrs), epidemiology, and epistemology: reflections on emrs and future pediatric clinical research
Acad. Pediatr.
A comparative study of tf* idf, lsi and multi-words for text classification
Expert Syst. Appl.
Key technology based on network pharmacology of complex networks
2016 IEEE International Conference on Big Data Analysis (ICBDA)
Emergence of scaling in random networks
Science
Gephi: an open source software for exploring and manipulating networks.
ICWSM
A neural probabilistic language model
J.Mach.Learn.Res.
Fast unfolding of communities in large networks
J. Stat. Mech.
Evaluating evaluation measure stability
Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Large-sample learning of bayesian networks is np-hard
J. Mach. Learn. Res.
Critical phenomena in complex networks
Rev. Mod. Phys.
On random graphs i
Publicationes Mathematicae
Cited by (27)
LMKG: A large-scale and multi-source medical knowledge graph for intelligent medicine applications
2024, Knowledge-Based SystemsFIT-graph: A multi-grained evolutionary graph based framework for disease diagnosis
2024, Artificial Intelligence in MedicineNovel medical question and answer system: Graph convolutional neural network based with knowledge graph optimization
2023, Expert Systems with ApplicationsHKGB: An Inclusive, Extensible, Intelligent, Semi-auto-constructed Knowledge Graph Framework for Healthcare with Clinicians’ Expertise Incorporated
2020, Information Processing and ManagementCitation Excerpt :The data in the medical domain comes from many data sources. Some Big Data vendors collect and store a large number of Electronic Medical Records (EMRs) in hospitals, and these records contain valuable medical information (Zhao, Jiang, Xu, & Guan, 2017). EMRs have the advantage of being closer to the actual practice of medication than the pedagogical and curated information presented in textbooks and research publications.
Learning an expandable EMR-based medical knowledge network to enhance clinical diagnosis
2020, Artificial Intelligence in MedicineCitation Excerpt :In our EMR-based MKN, all edges only exist between symptom-disease pairs, which means the same type of entities are not connected. Our previous work [39] has proved that the EMR-based MKN has small-world and scale-free properties, thus, we measured the topology of the network by the average degree, the mean shortest distance between two nodes and the diameter of the maximum connectivity subgraph [48], which often used in complex networks. A higher accuracy indicated a stronger ability of the network to diagnose this disease.
Real-world data medical knowledge graph: construction and applications
2020, Artificial Intelligence in MedicineCitation Excerpt :Rotmensch M et al. [26] builds a graph of 156 diseases and 491 symptoms based on 273,174 patient visits to the emergency department. Zhao C et al. [27] constructs an EMR-based medical knowledge network (EMKN) by extracting the medical entities, which contains 6733 nodes and 154,462 edges. Based on the network, a diagnosis model based on only symptoms is proposed.