A study of EMR-based medical knowledge network and its applications

doi:10.1016/j.cmpb.2017.02.016

Computer Methods and Programs in Biomedicine

Volume 143, May 2017, Pages 13-23

https://doi.org/10.1016/j.cmpb.2017.02.016 Get rights and content

Highlights

•
We construct an EMR-based medical knowledge network (EMKN) to represent medical knowledge.
•
The network shows some complex properties, including small-world property, scale-free property and community structure.
•
Based on the network, we propose a simple but universal diagnosis model. This model takes symptom entities as inputs, and returns the disease entities as outputs.
•
This model only needs simple features as inputs, and is not restricted to a specific disease.
•
Our study results experimentally demonstrate the effectiveness of EMKN.

Abstract

Background and Objective

Electronic medical records (EMRs) contain an amount of medical knowledge which can be used for clinical decision support. We attempt to integrate this medical knowledge into a complex network, and then implement a diagnosis model based on this network.

Methods

The dataset of our study contains 992 records which are uniformly sampled from different departments of the hospital. In order to integrate the knowledge of these records, an EMR-based medical knowledge network (EMKN) is constructed. This network takes medical entities as nodes, and co-occurrence relationships between the two entities as edges. Selected properties of this network are analyzed. To make use of this network, a basic diagnosis model is implemented. Seven hundred records are randomly selected to re-construct the network, and the remaining 292 records are used as test records. The vector space model is applied to illustrate the relationships between diseases and symptoms. Because there may exist more than one actual disease in a record, the recall rate of the first ten results, and the average precision are adopted as evaluation measures.

Results

Compared with a random network of the same size, this network has a similar average length but a much higher clustering coefficient. Additionally, it can be observed that there are direct correlations between the community structure and the real department classes in the hospital. For the diagnosis model, the vector space model using disease as a base obtains the best result. At least one accurate disease can be obtained in 73.27% of the records in the first ten results.

Conclusion

We constructed an EMR-based medical knowledge network by extracting the medical entities. This network has the small-world and scale-free properties. Moreover, the community structure showed that entities in the same department have a tendency to be self-aggregated. Based on this network, a diagnosis model was proposed. This model uses only the symptoms as inputs and is not restricted to a specific disease. The experiments conducted demonstrated that EMKN is a simple and universal technique to integrate different medical knowledge from EMRs, and can be used for clinical decision support.

Introduction

The electronic medical record (EMR) is the storage of all health care data and medical history of a patient in an electronic format [17]. These data include abundant medical knowledge, such as the current clinical diagnosis, medical history, results of investigations, treatment plans, and so on [12]. This is a novel and rich resource for clinical research [26], [44]. As the quantity of EMRs increase rapidly, medical professionals are overwhelmed by this ever-expanding knowledge. Therefore, several effective methods are being developed to assist these professionals [47].

Entities, and relationships between entities, are the primary carriers of medical knowledge in EMRs, and can be extracted by a natural language processing (NLP) technique [44]. Thus, entity recognition [25] and entity relationship extraction become the key tasks in the knowledge extraction of EMRs [29], [47]. Additionally, these tasks are an application for NLP in biomedical informatics [38], and have become the basis of many other tasks [28], [37].

If we regard the extracted entities as nodes, and entity relationships between two entities as edges, the knowledge derived from EMRs can be represented as a network. Network based methods have been extensively used for medical knowledge representation and inference. These networks can be divided into two major groups: Bayesian networks (probabilistic graphical models, more universal), and complex networks.

Many existing studies have attempted to diagnose or predict diseases based on Bayesian networks [9], [13], [14], [19], [27], [42]. One of the earliest, and most renowned diagnosis system is Pathfinder, which is a decision-theoretic expert system for hematopathology diagnosis [19]. Since that time, the Bayesian networks have played important roles in successive studies. Klann et al. [27] implemented an adaptive recommendation system to recommend a next order of treatment menu, based on the previous orders. Velikova et al. [42] built a probabilistic disease model for preeclampsia using the temporal Bayesian network and implemented it as part of a real-world home-monitoring system for personalized pregnancy care. Flores et al. [13] presented a methodology for incorporating expert knowledge as structural priors for learning the Bayesian networks, and applied it to the study of heart failures. As a probabilistic inference paradigm, Bayesian networks are suitable to model complex interactions between medical entities, and have an even higher accuracy than clinicians in some cases. However, the computational complexity of the structure, and parameter learning of Bayesian networks can become intractable once the number of nodes becomes larger [7]. Hundreds of nodes are difficult for Bayesian networks to handle, thus, the network can only be used in a single, specific field.

On the contrary, the complex network based models can analyze the large-scale data appropriately, however, this is not a specialized model for knowledge representation and inference. Complex networks refer to the large scale networks having small-world and scale-free properties [2], [45]. Compared with the random networks, in which any two nodes are independently connected with a fixed probability [11], [41], complex networks have a small number of nodes with a much larger number of connections. Studies focused on the network system have pointed out that many real networks are complex networks, rather than random networks [8], [10].

In the medical domain, the complex network plays an instrumental role by providing conceptual insights, as well as offering visual and computational methodology [15]. It is widely applied to disease-gene [16], protein-protein [43], disease-symptom [51] interaction analysis, as well as the field of pharmacology [49], [50], epidemiology [20], genetics [30] and brain science [24]. Goh et al. [16] proposed diseasome, a bipartite graph consisting of the disease and gene nodes. A disease and gene was connected, if mutations in that gene were implicated in that disease. Zhou et al. [51] constructed a symptom-based human disease network from biomedical literature databases, and investigated the connection between clinical manifestations of diseases and their underlying molecular interactions. Tachimori et al. [39] constructed a medical network utilizing a medical textbook. This network had the small-world and scale-free features, but did not show a knowledge inferring ability. To our knowledge, medical networks derived from EMRs have not yet been reported.

We believe such complex networks can support clinical decisions. Applying the artificial intelligence (AI) approach to disease diagnostics is receiving significant attention in the field of medical informatics. Typically, existing studies of AI diagnosis focus on one specific disease. Alizadehsani et al. [1] selected four combinations of related features for coronary artery disease. Then, they applied sequential minimal optimization, and other algorithms, in these groups of features and reached the best accuracy of 94.08%. Rau et al. [35] proposed a prediction model for developing liver cancer in type 2 diabetes mellitus patients. They identified 10 risk factors as variables, then constructed an artificial neural network (ANN) and logistic regression (LR) prediction models. The best results of sensitivity and specificity were 0.757 and 0.755, respectively. Hariharana et al. [18] selected 22 raw features from the voice signals of people diagnosed with Parkinson’s disease. Their diagnosis model consisted of feature pre-processing, feature reduction/selection, and different feature classifiers, using a support vector machine (SVM) and neural network. They obtained a classification accuracy of 100% for the test set.

In these studies, corresponding features of patients were manually selected, then the diagnostic problems were converted into classification tasks. After the feature selection and reduction was completed, classifiers were used to discriminate whether the patients had the disease. It could be observed that the classifier of a specific disease could obtain sufficient results to be used for clinical experience, and support a clinical decision. However, this type of study focused more on feature extraction and feature pre-processing. Good features can usually get satisfactory performance, while bad features cannot accurately distinguish the positive cases from negative ones. Good feature selection needs professional medical knowledge. Moreover, the model can only be applied to a certain disease. If people have certain symptoms, and want to know what affliction they might have, these systems are not of much help. We attempt to exploit the possibility of constructing a more universal diagnosis model, utilizing simple features (like symptoms) as inputs and not restricting it to a specific disease.

In this paper, we construct EMKN, a new Medical Knowledge Network based on EMRs. Nodes of this network are medical entities and edges are co-occurrence relationships between entities in the same records. The more frequently an entity pair occurs together, the larger the edge weight between this pair becomes. We calculate primary quantities of EMKN and validate its complex properties. Next, we propose a basic, universal diagnosis model using EMKN to show that EMKN can indeed express medical knowledge. We apply this model to real EMRs and prove its effectiveness.

There may be two innovations in this paper:

•
We construct EMKN, an EMR-based medical knowledge network, to represent medical knowledge. Then, we find the complex properties of this network.
•
We propose a simple, but universal diagnosis model based on the network. This model takes symptom entities as inputs, and can be used for multi-disease diagnosis.

The remainder of this paper is structured as follows. In Section 2, we give a brief introduction to our EMR corpus, then provide the construction approach and property analysis of EMKN. In Section 3, we describe the details of the diagnosis model. Then, we evaluate our model using actual records in Section 4, and give the results with discussion. Conclusion and future studies will be presented in Section 5.

Section snippets

Medical entities and assertions

Referencing the medical concept annotation guideline and assertion annotation guideline given by Informatics for Integrating Biology and the Bedside (i2b2) [21], [22], we have created guidelines for Chinese EMRs, and manually annotated 992 records as the corpus under the guidance of medical professionals [46]. These records were retrieved from The Second Affiliated Hospital of Harbin Medical University, and contained 887 individual patients. We have obtained the usage rights for these records. ¹

Disease inference

In medical diagnosis, we are typically faced with a set of symptoms and test results. Then, we attempt to find the cause behind them. To make use of the medical knowledge of EMKN, we designed a diagnosis model, which can infer the corresponding diseases with the given symptoms.

We think symptoms are caused by diseases. If we view the symptoms as vectors based on the diseases, we can describe the causality to some degree. Suppose there are N distinct disease entities. Then, the symptom vector

Experiment setup

To evaluate our model, we randomly selected 700 records from our corpus as the training set, and 292 records remained as the test set. We re-constructed the SDEMKN using the new training data. In the process of entity extraction, we added simple rules to unify parts of entities which had the same meanings. For a test record, we took symptoms with positive assertions as input to infer the probable diagnosis result. Diseases in the record with non-absent assertion were regarded as the actual

Conclusion

We constructed EMKN, an EMR-based Medical Knowledge Network, using a manually annotated EMR corpus. This network took medical entities as nodes and the co-occurrence relationships of entities in the same record as edges. We obtained main quantities of EMKN and validated its small-world and scale-free properties. In addition, we illustrated that the community structure of this network was associated with the real department information.

According to the subgraph SDEMKN, we represented the

Acknowledgment

This work is supported by the Natural Science Foundation of China (No. 61672185). We thank the Second Affiliated Hospital of Harbin Medical University for providing the corpus used in this study. We also thank the anonymous reviewers for their comments, which provided us with significant guidance.

References (51)

R. Alizadehsani et al.
A data mining approach for diagnosis of coronary artery disease
Comput. Methods Programs Biomed.
(2013)
J. Cong et al.
Approaching human language with complex networks
Phys. Life Rev.
(2014)
A.C. Constantinou et al.
From complex questionnaire and interviewing data to intelligent bayesian network models for medical decision support
Artif. Intell. Med.
(2016)
P. Fuster-Parra et al.
Bayesian network modeling: a case study of an epidemiologic system analysis of cardiovascular risk
Comput. Methods Programs Biomed.
(2016)
M. Hariharan et al.
A new hybrid intelligent system for accurate detection of parkinson’s disease
Comput. Methods Programs Biomed.
(2014)
U. Iqbal et al.
Cancer-disease associations: a visualization and animation through medical big data
Comput. Methods Programs Biomed.
(2016)
S. Keretna et al.
Enhancing medical named entity recognition with an extended segment representation technique
Comput. Methods Programs Biomed.
(2015)
J.G. Klann et al.
Decision support from local data: creating adaptive order menus from past clinician behavior
J. Biomed. Inf.
(2014)
H. López-Fernández et al.
Bioannote: a software platform for annotating biomedical documents with application in medical learning environments
Comput. Methods Programs Biomed.
(2013)
M. Madkour et al.
Temporal data representation, normalization, extraction, and reasoning: a review from clinical domain
Comput. Methods Programs Biomed.
(2016)

M. Pettorruso et al.

Punding in non-demented parkinson’s disease patients: relationship with psychiatric and addiction spectrum comorbidity

J. Neurol. Sci.

(2016)

R.A.A. Seoud et al.

Tmt-hcc: a tool for text mining the biomedical literature for hepatocellular carcinoma (hcc) biomarkers identification

Comput. Methods Programs Biomed.

(2013)

Y. Tachimori et al.

The networks from medical knowledge and clinical practice have small-world, scale-free, and hierarchical features

Physica A

(2013)

M. Velikova et al.

Exploiting causal functional relationships in bayesian network modelling for personalised healthcare

Int. J. Approximate Reasoning

(2014)

R.C. Wasserman

Electronic medical records (emrs), epidemiology, and epistemology: reflections on emrs and future pediatric clinical research

Acad. Pediatr.

(2011)

W. Zhang et al.

A comparative study of tf* idf, lsi and multi-words for text classification

Expert Syst. Appl.

(2011)

X. Zhou et al.

Key technology based on network pharmacology of complex networks

2016 IEEE International Conference on Big Data Analysis (ICBDA)

(2016)

A.-L. Barabási et al.

Emergence of scaling in random networks

Science

(1999)

M. Bastian et al.

Gephi: an open source software for exploring and manipulating networks.

ICWSM

(2009)

Y. Bengio et al.

A neural probabilistic language model

J.Mach.Learn.Res.

(2003)

V.D. Blondel et al.

Fast unfolding of communities in large networks

J. Stat. Mech.

(2008)

C. Buckley et al.

Evaluating evaluation measure stability

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

(2000)

D.M. Chickering et al.

Large-sample learning of bayesian networks is np-hard

J. Mach. Learn. Res.

(2004)

S.N. Dorogovtsev et al.

Critical phenomena in complex networks

Rev. Mod. Phys.

(2008)

P. Erds et al.

On random graphs i

Publicationes Mathematicae

(1959)

Cited by (27)

LMKG: A large-scale and multi-source medical knowledge graph for intelligent medicine applications
2024, Knowledge-Based Systems
Medical Knowledge Graph (KG) has shown great potential in various healthcare scenarios, such as drug recommendation and clinical decision support system. The factors that determine the role of a medical KG in practical applications are the scale, coverage, and quality of the medical knowledge it can provide. Most existing medical KGs are extracted from a single or a few information sources. However, medical knowledge extracted from insufficient information sources is usually highly incomplete or even biased, which results in a lack of data completeness and may lessen their effectiveness in real-world scenarios. Besides, the coverage of entity and relation types is inadequate in most previous works, which also might restrict their potential usage in future applications. In this paper, we build a unified system that can extract and manage medical knowledge from heterogeneous information sources. We first employ named entity recognition and relation extraction methods to extract knowledge triplets from medical texts. Then we propose a hierarchical entity alignment framework for further knowledge refinement. Based on our system, we construct a large-scale, high-quality, multi-source, and multi-lingual medical KG named LMKG, which includes 13 entity types and 17 relation types, and contains 403,784 entity and 1,225,097 relation instances. We conduct extensive experiments to evaluate the quality of LMKG. Experimental results show that LMKG can effectively enhance the performance of both upstream and downstream intelligent medicine applications. We have publicly released the KG resources and corresponding management service interface to facilitate research and applications in the medical field.
FIT-graph: A multi-grained evolutionary graph based framework for disease diagnosis
2024, Artificial Intelligence in Medicine
Early assessment, with the help of machine learning methods, can aid clinicians in optimizing the diagnosis and treatment process, allowing patients to receive critical treatment time. Due to the advantages of effective information organization and interpretable reasoning, knowledge graph-based methods have become one of the most widely used machine learning algorithms for this task. However, due to a lack of effective organization and use of multi-granularity and temporal information, current knowledge graph-based approaches are hard to fully and comprehensively exploit the information contained in medical records, restricting their capacity to make superior quality diagnoses. To address these challenges, we examine and study disease diagnosis applications in-depth, and propose a novel disease diagnosis framework named FIT-Graph. With novel medical multi-grained evolutionary graphs, FIT-Graph efficiently organizes the extracted information from various granularities and time stages, maximizing the retention of valuable information for disease inference and ensuring the comprehensiveness and validity of the final disease inference. We compare FIT-Graph with two real-world clinical datasets from cardiology and respiratory departments with the baseline. The experimental results show that its effect is better than the baseline model, and the baseline performance of the task is improved by about 5% in multiple indices.
Novel medical question and answer system: Graph convolutional neural network based with knowledge graph optimization
2023, Expert Systems with Applications
In order to effectively integrate medical data and alleviate the problem of uneven distribution of medical resources. In this paper, we combine the techniques of expert systems, graph neural networks, and knowledge graphs to propose a disease guidance model combining semi-supervised graph neural networks and knowledge graphs. We use the MASR speech recognition module combined with gated convolutional units for effective text processing of different types of speech; then we use the LTP module in natural language processing for semantic analysis and segmentation matching of interrogative sentences; we combine keywords with the number of diseases and divide and construct the set of nodes with knowledge graphs. And we use semi-supervised graph neural network type analysis to give treatment results and rehabilitation suggestions effectively. We optimize the Chinese and English corpora respectively, adding consideration for local dialect audiences. We performed a comprehensive comparison of the accuracy and training time of several mainstream GCN algorithms and our GCN semi-supervised (SGS) under various graphical text datasets to validate the efficiency and accuracy of our own algorithm choices. We preprocess the number of different symptoms for classification and simplify the redundant nodes to optimize the running time while taking into account the overall convergence. The operational mechanism of the model as well as the convergence and hits under different symptom parameters are explained through hit rate and convergence rate metrics to demonstrate the effectiveness and stability of the model under proprietary medical conditions.
HKGB: An Inclusive, Extensible, Intelligent, Semi-auto-constructed Knowledge Graph Framework for Healthcare with Clinicians’ Expertise Incorporated
2020, Information Processing and Management
Citation Excerpt :
The data in the medical domain comes from many data sources. Some Big Data vendors collect and store a large number of Electronic Medical Records (EMRs) in hospitals, and these records contain valuable medical information (Zhao, Jiang, Xu, & Guan, 2017). EMRs have the advantage of being closer to the actual practice of medication than the pedagogical and curated information presented in textbooks and research publications.
Health knowledge graph provides an ideal technical means to integrate heterogeneous data resources and enhance knowledge-based services. There are many challenges for the construction of health knowledge graph such as complex concepts and relationships, various medical standards, heterogeneous data structures, poor data quality, highly accurate and interpretable services, etc.
In this paper, firstly, we propose Health Knowledge Graph Builder (HKGB), an end-to-end platform which could be used to construct disease-specific and extensible health knowledge graphs from multiple sources. Secondly, we analyze the capabilities and requirements of clinicians, design the tasks to involve the clinicians and implement a clinician-in-the-loop toolset to integrate the clinicians prior knowledge into the construction of health knowledge graphs. Thirdly, we design an extensible mechanism to add new diseases to an existing knowledge graph. Fourthly, we present a quantitative effort estimation algorithm to quantitatively evaluate the effort of clinicians during the construction, and use it to calculate the workloads such as 44.27 person days for knee osteoarthritis domain. Finally, we have developed several knowledge graph based tools to facilitate real applications.
Learning an expandable EMR-based medical knowledge network to enhance clinical diagnosis
2020, Artificial Intelligence in Medicine
Citation Excerpt :
In our EMR-based MKN, all edges only exist between symptom-disease pairs, which means the same type of entities are not connected. Our previous work [39] has proved that the EMR-based MKN has small-world and scale-free properties, thus, we measured the topology of the network by the average degree, the mean shortest distance between two nodes and the diameter of the maximum connectivity subgraph [48], which often used in complex networks. A higher accuracy indicated a stronger ability of the network to diagnose this disease.
Electronic medical records (EMRs) contain a wealth of knowledge that can be used to assist doctors in making clinical decisions like disease diagnosis. Constructing a medical knowledge network (MKN) to link medical concepts in EMRs is an effective way to manage this knowledge. The quality of the diagnostic result made by MKN-based clinical decision support system depends on the accuracy of medical knowledge and the completeness of the network. However, collecting knowledge is a long-lasting and cumulative process, which means it’s hard to construct a complete MKN with limited data. This study was conducted with the objective of developing an expandable EMR-based MKN to enhance capabilities in making an initial clinical diagnosis. A network of symptom-indicate-disease knowledge in 992 Chinese EMRs (CEMRs) was manually constructed as Original-MKN, and an incremental expansion framework was applied to it to obtain an expandable MKN based on new CEMRs. The framework was composed by: (1) integrating external knowledge extracted from the medical information websites and (2) mining potential knowledge with new EMRs. The framework also adopts a diagnosis-driven learning method to estimate the effectiveness of each knowledge in clinical practice. Experimental results indicate that our expanded MKN achieves a precision of 0.837 for a recall of 0.719 in clinical diagnosis, which outperforms Original-MKN and four classical machine learning methods. Furthermore, both external medical knowledge and potential medical knowledge benefit MKN expansion and disease diagnosis. The proposed incremental expansion framework sustains the MKN learning new knowledge.
Real-world data medical knowledge graph: construction and applications
2020, Artificial Intelligence in Medicine
Citation Excerpt :
Rotmensch M et al. [26] builds a graph of 156 diseases and 491 symptoms based on 273,174 patient visits to the emergency department. Zhao C et al. [27] constructs an EMR-based medical knowledge network (EMKN) by extracting the medical entities, which contains 6733 nodes and 154,462 edges. Based on the network, a diagnosis model based on only symptoms is proposed.
Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.
The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.
A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain (NDCG@10) increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.
The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity’s semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet.
where $N_{c o}^{m i n}$ is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between S_i and O_ij. The reason for the definition is the higher value of N_co(S_i, O_ij), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set $N_{c o}^{m i n}$ = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.

View all citing articles on Scopus

View full text

A study of EMR-based medical knowledge network and its applications

Highlights

Abstract

Background and Objective

Methods

Results

Conclusion

Introduction

Section snippets

Medical entities and assertions

Disease inference

Experiment setup

Conclusion

Acknowledgment

Comput. Methods Programs Biomed.

Phys. Life Rev.

Artif. Intell. Med.

Comput. Methods Programs Biomed.

Comput. Methods Programs Biomed.

Comput. Methods Programs Biomed.

Comput. Methods Programs Biomed.

J. Biomed. Inf.

Comput. Methods Programs Biomed.

Comput. Methods Programs Biomed.

J. Neurol. Sci.

Comput. Methods Programs Biomed.

Physica A

Int. J. Approximate Reasoning

Acad. Pediatr.

Expert Syst. Appl.

Emergence of scaling in random networks

Science

Gephi: an open source software for exploring and manipulating networks.

ICWSM

A neural probabilistic language model

J.Mach.Learn.Res.

Fast unfolding of communities in large networks

J. Stat. Mech.

Evaluating evaluation measure stability

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Large-sample learning of bayesian networks is np-hard

J. Mach. Learn. Res.

Critical phenomena in complex networks

Rev. Mod. Phys.

On random graphs i

Publicationes Mathematicae