EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

Hasan, S.M.Shamimul; Fox, Edward A.; Bisset, Keith; Marathe, Madhav V.

doi:10.1007/s41666-017-0010-9

EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

Research Article
Published: 06 November 2017

Volume 1, pages 260–303, (2017)
Cite this article

Download PDF

Journal of Healthcare Informatics Research Aims and scope Submit manuscript

EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

Download PDF

S.M.Shamimul Hasan ORCID: orcid.org/0000-0001-7373-250X^1,2,
Edward A. Fox²,
Keith Bisset¹ &
…
Madhav V. Marathe^1,2

3318 Accesses
3 Citations
Explore all metrics

Abstract

Computational epidemiology seeks to develop computational methods to study the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. Recent advances in computing and data sciences have led to the development of innovative modeling environments to support this important goal. The datasets used to drive the dynamic models as well as the data produced by these models presents unique challenges owing to their size, heterogeneity and diversity. These datasets form the basis of effective and easy to use decision support and analytical environments. As a result, it is important to develop scalable data management systems to store, manage and integrate these datasets. In this paper, we develop EpiK—a knowledge base that facilitates the development of decision support and analytical environments to support epidemic science. An important goal is to develop a framework that links the input as well as output datasets to facilitate effective spatio-temporal and social reasoning that is critical in planning and intervention analysis before and during an epidemic. The data management framework links modeling workflow data and its metadata using a controlled vocabulary. The metadata captures information about storage, the mapping between the linked model and the physical layout, and relationships to support services. EpiK is designed to support agent-based modeling and analytics frameworks—aggregate models can be seen as special cases and are thus supported. We use semantic web technologies to create a representation of the datasets that encapsulates both the location and the schema heterogeneity. The choice of RDF as a representation language is motivated by the diversity and growth of the datasets that need to be integrated. A query bank is developed—the queries capture a broad range of questions that can be posed and answered during a typical case study pertaining to disease outbreaks. The queries are constructed using SPARQL Protocol and RDF Query Language (SPARQL) over the EpiK. EpiK can hide schema and location heterogeneity while efficiently supporting queries that span the computational epidemiology modeling pipeline: from model construction to simulation output. We show that the performance of benchmark queries varies significantly with respect to the choice of hardware underlying the database and resource description framework (RDF) engine.

Quarry: A User-centered Big Data Integration Platform

Article 18 April 2020

A computational framework for modeling and studying pertussis epidemiology and vaccination

Article Open access 16 September 2020

UTLDR: an agent-based framework for modeling infectious diseases and public interventions

Article Open access 17 June 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Epidemiology is the study of the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems [1,2,3,4,5,6]. Computational and digital epidemiology aims to develop computational models, analytics, and decision support tool to support epidemic science [2, 7]. The science and practice of the discipline has matured steadily over the past decades. Advances in computing and data science have facilitated the development of novel technologies and apps that support epidemic planning and response. These tools are now viewed as important assets for epidemiologists and public health officials. Epidemiologists use such tools to study a number of epidemic policy questions, including forecasting, planning, situational awareness, and intervention analysis; see [4, 8,9,10,11,12,13,14,15,16,17] and the references therein.

1.1 Agent-Based Epidemiological Modeling and Analytics

Aggregate or collective computational epidemiology models that have been studied in the literature for over a century often assume that a population is partitioned into a few subpopulations (e.g., by age) with a regular interaction structure within and between subpopulations. Although useful for obtaining analytical expressions for a number of interesting parameters such as the numbers of sick, infected, and recovered individuals in a population, it does not capture the complexity of human interactions that serves as a mechanism for disease transmission. See [4,5,6, 18, 19] for more details.

Agent-based models (aka network-based models) extend the aggregate models by representing the underlying interactions by dynamic networks. These class of models capture the interplay between the four components of computational epidemiology: (i) individual behaviors of agents, (ii) unstructured, heterogeneous multi-scale networks, (iii) the dynamical processes on these networks, and (iv) contextualized social and pharmaceutical implementable interventions. They are based on the hypothesis that a better understanding of the characteristics of the underlying network and individual behavioral adaptation can give better insights into contagion dynamics and response strategies. Although computationally expensive and data intensive, agent-based epidemiology alters the types of questions that can be posed, providing qualitatively different insights into disease dynamics and public health policies. It also allows policy makers to formulate and investigate potentially novel and context-specific interventions. We refer the reader to [1, 2, 7, 11, 12, 18,19,20] for further discussion on this subject.

Agent-based models and analytics have become increasingly popular and are often the model of choice to study certain kinds of epidemiological questions. They are harder to build, largely due to the expertise in computing and data science needed to build them. For instance, implementing these models on computing clusters and analyzing the outputs produced by such models requires a fair bit of expertise in computing and data science. This has led to the development of web-apps and modeling environments that make it easier for analysts to use such models and analytical tools. Our group was the first group that developed such a web-based modeling and analytics environment. It is called SIBEL—Synthetic Information Based Epidemiological Laboratory ^{Footnote 1}; screenshots of the system are shown in shown in Figs. 1, and 2a, b [21]. SIBEL was specifically designed so that epidemiologists can carry out sophisticated what-if studies using agent-based epidemiological models and analytics without becoming computing experts. A few other groups have also begun developing similar modeling and analytics environments; see [22, 23]. Other significant recent related efforts include the following: (i) CDC FluView [24]; (ii) HealthMap [25]; (iii) Texas Pandemic Flu Toolkit [26]; (iv) LANL BARD [27], (v) EpiC [28].

1.2 The Need for a Knowledge Base

The input and output datasets arising in the context of developing agent-based modeling environments for epidemics are large, diverse and heterogeneous. They are stored in various formats and over multiple storage devices. This makes the task of developing effective analytical and decision support tools challenging. This motivates the need to develop a knowledge-base (EpiK) to store, link and efficiently retrieve the diverse datasets. For the purposes of this paper, we use the term knowledge base to denote a data management system to represent, organize, collect, store, and integrate structured and unstructured information in a machine-readable form.

The motivation to develop EpiK arose as we were working on SIBEL. To understand this, we briefly describe how systems like SIBEL are used to support a typical case study.

1.2.1 What Is a Case Study?

An in silico epidemiological case study evaluates outcomes generated by a computational experiment. In these experiments, users can specify the following:

1.
A social contact network
2.
A within-host disease progression model
3.
A set of initial conditions
4.
A set of interventions

Each intervention requires additional details such as compliance level, sub-populations to which interventions are applied, and under what conditions the interventions take place. The experiment contains one or more sweeping parameter(s) across a user-specified range of values. For example, experiments in a typical study would be divided into one or more cell(s). A cell may have multiple interventions. Vaccination, social distancing, and closing schools are a few examples. In order to arrive at a stochastically robust conclusion, each experiment is run multiple times but with different initial conditions (for example, the number of infectious persons at the beginning of the simulation) [21]. A web portal such as SIBEL provides an easy-to-use and intuitive interface to carry out such an experiment.

1.2.2 An Illustrative Example

A typical case study involves planning for an infectious disease outbreak such as influenza-like illness (ILI). Consider the following hypothetical situation: Alice and Bob are two epidemiologists tasked with carrying out computer experiments to study various what-if scenarios concerning the spread of ILI in the USA. They are particularly interested in understanding the role of interventions in controlling the outbreak. Both Alice and Bob decide to use SIBEL for their computer experiments. As a first step, Alice selects Boston, MA, as a region of interest since reports suggest that it might be one of the first regions to experience an ILI outbreak this season. Bob chooses Houston, TX, as a region of interest, keeping in mind that the last influenza pandemic started in Mexico. Using SIBEL, both select different disease parameter values, including susceptibility, infectivity, and days to recovery. They then choose a complex set of interventions, including school closure, social distancing, and antiviral distributions, and decide how these interventions are implemented (who, when, and for how long). The entire workflow effectively sets up a formal statistical experimental design. Launching the experiments is easy with SIBEL; it involves simply pushing the run button. Alice and Bob do not have to know where the data is stored and where the computation will be carried out. Once the experiments are completed, they can carry out a detailed analysis. Basic results of the experiments are displayed to the user through a set of plots and aggregate statistics.

1.2.3 A More Detailed Epidemic Investigation and Initial Challenges for SIBEL

Suppose, upon completion of the study, Alice seeks to infer additional information about a few infected individuals including their home locations and their daily activities before the time of infection. Alice might also look for infected individuals are known as super-spreaders, while Bob might be interested in the role of critical workers and school-aged children in disease transmission. Both might also want to do a comparative analysis to see the differences and similarities between the epidemics in the two regions. This would help them understand the impact of social networks and the built infrastructure on epidemic dynamics and in turn may help them design and study contextualized intervention policies. Investigations such as the ones above can be cast as workflows involving queries that Alice and Bob write.

To retrieve the results for the abovementioned queries, a user with access to the servers must determine the cell information of the experiment from the SIBEL user interface. Then the system administrator must read the cell directory from the HPC resource. The cell directory contains the configuration file. The configuration file provides the output file location in the HPC resource. The output file provides infector and infectee information and their identification (ID) numbers. The configuration file contains the social network folder path information as well. From there, synthetic population location can be determined and stored in a relational database. By querying that database with the IDs of infected individuals, Alice/Bob can discover activities performed by the infected person in the last few days, the disease transmission between various demographic groups, and their location information. In its current form, SIBEL does not support this kind of enhanced analysis. In other words, answering queries such as this tends to be a manual, tedious, and laborious process because of its use of heterogeneous and fragmented data sources.

1.2.4 Further Challenges as SIBEL Is Used and Grows

A key aspect of SIBEL is its simplicity—public health analysts are trained to make effective use of the system in approximately three hours [21]. Studies that took months and days can now be done in hours using SIBEL. The early days of SIBEL saw significant effort devoted toward making the system user-friendly and toward modeling realistic epidemic situations by scaling the backend simulation engine [29,30,31]. As a result of this effort, SIBEL is used by epidemiologists, computational researchers, and public health experts to model the spread of infectious diseases over large geographical regions with populations of up to 50 million people. We have since performed over 30 case studies requested by our sponsors in many metropolitan areas in the United States. As a recent example, SIBEL was used to support DTRA’s Ebola response efforts in West Africa. SIBEL is constantly evolving; new modeling engines are being added as a part of the backend; new kinds of analysis is supported and new populations and regions are being added.

The continued growth and use of SIBEL have made the problems identified above even more challenging. The output datasets are growing rapidly as analysts use SIBEL routinely. As a result, comparative analysis of experiments and contextual analysis of outcomes that need to reason over both input and output data has become progressively challenging. Data-centric tasks like performing validation, verification, analysis, and model refinement also became cumbersome and challenging. This problem is further exacerbated because the exponential growth in the volume of data is intertwined with heterogeneity (in data types) and fragmentation (in storage) present in the datasets. Heterogeneity in data types is pervasive across the entire spectrum of the discipline, i.e., starting from data acquisition (of surveillance, census, etc.) to analysis of simulation outputs. A typical simulation makes use of a sequence of digital content types including input parameter configurations, raw datasets, result summaries, analyses and plots, documentation, publications, and annotations. A myriad of datasets is generated as a simulation output. SIBEL is distributed and part of a larger ecosystem. Fragmentation of the storage is a major hurdle, primarily arising due to the logistics of maintaining such large datasets and the architecture of the SIBEL infrastructure itself. All these factors contribute toward the current state where organization, querying, annotation, and other services become cumbersome and often infeasible. Small changes to the schema at the storage layer cascade into significant changes in the data access and application layer. EpiK addresses these challenges.

Although the discussion was framed in the context of SIBEL, the issues discussed are generic — all epidemic modeling environments for decision support and planning face similar challenges.

1.3 Contributions

The paper describes EpiK—a knowledge base that can support agent-based epidemiological modeling, analytics, and decision-making. EpiK designed to store, link, and query diverse datasets are arising in epidemic modeling and counterfactual analysis. We focus on agent-based networked models; supporting compartmental models is considerably easier and can be seen as a subcase. EpiK uses modern semantic web technology to unify and link diverse dynamic datasets and represent them in a machine-understandable format. It provides methods for building semantic graphs and provides a query bank to improve analysts’ productivity.

The paper focuses on agent-based epidemiological modeling and analytics of infectious diseases. Extensions to chronic and environmental disease epidemiology will be undertaken subsequently. The use of EpiK to support aggregate modeling is relatively easy.

In Fig. 3, we present the main flow of the agent-based epidemic modeling and analytics. EpiK provides linked access to SIBEL data sources such as synthetic population and contact data, computational experiment setup data, and computational experiment output data. The datasets may contain data of different types, such as demographic, aggregate, and sequence, or may differ in the storage technology (such as ASCII files, binary files), and may be distributed physically across different machines in the network. We discuss two approaches for machine-understandable RDF representations. We show the prospect of virtual and materialized views when developing EpiK.

A programmatic data access capability is provided to access the data organized within EpiK. This permits data sharing and interoperability and makes it easier for others to build applications on top of the simulation data. The user can execute complex queries over heterogeneous SIBEL contents. We demonstrate the utility of EpiK by developing a query bank. The queries are representative of typical analysis carried out by an epidemiologist during the course of planning or response phase of an epidemic. These queries are based on the 5W1H concept and demonstrate the utility of semantic web technologies. Specific contributions are summarized below.

1. Linked Data Access and Query Execution Framework

A central contribution of EpiK is a federation of all datasets relevant to agent-based epidemiological modeling and analytics using SIBEL—including input, output, and experimental conditions. The proposed framework offers various advantages: (i) linking large amounts of data sources, (ii) linking a variety of data sources, (iii) access to programmatic data, and (iv) query efficiency. The methodology is general—as new datasets arrive, they can be easily integrated with the proposed EpiK framework. Data linking, in EpiK, is achieved through native relational database (RDB) to RDF data mappings. The RDF tools and data mappings are central to our research. This linked data framework, to a large extent, addresses the challenges faced due to heterogeneity in schema, formats, and storage across the datasets in the domain science. The key concept to achieve this end is to define a representation of the datasets that encapsulate both the location and the schema heterogeneity. The proposed framework provides a linked view to access and query end-to-end epidemiology workflow datasets.

2. Native to RDF Mappings (Techniques and Their Trade-Offs)

We study two classes of RDF mappings in this context: value-based and tuple-based. Fundamentally, the difference in these two mappings lies in how relationships among entities are expressed and materialized. Value-based mapping preserves child-parent entities relationships while tuple-based mapping completely ignores it. This paper investigates the trade-offs of the mappings with respect to schema clarity (including ease in understanding the data organization and ability to frame queries), storage cost, and efficiency of answering queries.

3. Query Bank and Benchmarking

We develop a benchmark to evaluate different implementations for a homogeneous query execution framework. The homogeneous query execution framework builds a federation of the disparate datasets and exposes them (using the mappings) as a unified whole. The benchmark is a collection of SPARQL queries over the federated SIBEL data. Collectively, they capture access patterns and workloads that appear frequently in the domain. The queries are representative of real-time epidemiology queries. We use the benchmark to evaluate two kinds of query execution frameworks: one relational engine-based and the second RDF engine-based. Relational engines are space efficient and therefore can scale to very large datasets while RDF engines are query efficient and therefore useful when a real-time response is mandatory.

4. Evaluation of the Design Space for Homogeneous Data Access and Querying Framework

The choice with respect to style of representation (value- vs. tuple-based) combined with the implementation of the execution platform (relational engine-based vs. RDF engine-based) provides us with four possible design points for the proposed platform. Each design point has different trade-offs. This work evaluates all the four possible implementations with respect to the benchmark.

In summary, we provide a solution for how to build a linked database for heterogeneous and federated computational epidemiology data sources. Preliminary evaluation of the framework demonstrates that leveraging semantic web technology (in particular RDF concepts) is an effective strategy for handling data-centric challenges faced in such distributed, multi-user, large-scale computational science. Our approach demonstrates that fast unified access across fragmented datasets is feasible. Different strategies offer trade-offs with respect to storage space or query time. Query execution in a pure RDF engine is faster, but has a large storage cost. On the other hand, in the hybrid relational-RDF engine, we have zero RDF graph storage cost but efficiency is compromised.

2 Related Work

Our work builds on earlier work in scientific data management, digital libraries, and semantic web tools. Scientific data management is a long-standing research field, driven by the fact that most scientific software produces a myriad of specialized data. Early work on scientific data management introduced the concept of a process-oriented scientific database model (POSDBM) [32], which includes two data objects and relation types. This work is extended with key semantic elements of scientific experimentation by Pratt et al. [33].

Shi et al. [34] presents a web-based epidemiology reporting system that uses Google Maps data. Allon et al. [35] describes a leukemia epidemiology study where the data model arises from the experiment based on a relational modeling approach. In the domain of the Semantic Web, tools that map data from the relational model to RDF (and vice versa) have been an active area of research [36,37,38,39,40]. Bertails et al. demonstrates the power of the mapping to treat all of the important features of SQL tables, like cardinality and NULLs, and to yield an RDF graph which preserves the relational information [37]. Hert et al. illustrate ontology-based access to relational databases as discussed in [41]. It describes a mapping language, the translation algorithms, and a prototype implementation. Robert et al. develop a schema ontology for healthcare using reference information model (RIM). This modeling helps better manage healthcare effectiveness data and information set (HEDIS) measures by providing a rule ontology that is matched to the language of the specification [42]. Horrocks et al. propose an ontology-based data access (OBDA) solution for diverse varieties of data management. They describe their efforts in the context of managing disparate Siemens Energy Services datasets; see [43] for further discussion.

Multiple studies have addressed processing queries using mapping [44,45,46]. Bornea et al. [44] describe novel query translation techniques as well. Experiments show that the approach provides good results when compared with current state-of-the-art stores. Groppe et al. describe an SPARQL query optimization technique that uses seven indices to retrieve RDF data quickly [45]. This approach computes joins by dynamically restricting triple patterns, and provides good efficiency. Arenasa et al. introduce a theoretical foundation for faceted search explicitly designed for RDF-based knowledge graphs improved with OWL 2 ontologies. Authors also investigate convenient faceted interfaces [47].

Unification of heterogenous data using semantic tools has been actively studied for biological data [48, 49]. BioPortal provides biomedical ontologies [50]. Federation (or mediation) of physically distributed data sources using RDF (and other semantic constructs) in DartGrid is presented in [51], which provides RDF/OWL to define the mediated ontologies for integration, as well as automatic conversion rules from the relational schema to RDF/OWL descriptions. Kamdar et al. proposes an approach for Ebola virus knowledge-based creation using semantic web technologies [52].

Examples of scientific digital libraries include earthquake simulation repositories [53], embedded sensor network DLs [54], and D4Science II [55]. Barrett et al. describe a data management tool to study infectious diseases [56]. Schriml et al. provide the GeMIna system, that uses epidemiology metadata to identify infectious pathogens and their representative genomic sequences [57]. The VecNet digital library maintains curated data, tagged citations, and articles related to epidemiology [58]. Leidig describes a scientific digital library to manage epidemiology experiments and simulations, but does not provide any specific framework for heterogeneous epidemiology big data management [59].

3 The EpiK: Data Federation and Homogeneous Query Execution Framework

The proposed EpiK consists of three layers: (i) the data layer: datasets (Section 3.1), (ii) the mapping layer (Section 3.2), and (iii) RDF engine and services layer (Section 3.3). See Fig. 4. We discuss each of the layers below.

3.1 The Data Layer

We start by describing EpiK’s dataset layer; see Fig. 4. Here, we focus on by SIBEL. The datasets are summarized in Fig. 5 and comprise the (i) synthetic population, synthetic people-location network, and synthetic social contact network, (ii) experimental setup data consisting of disease model parameters and intervention specification, and (iii) experimental output data comprising of epidemic curves, dendograms, and various analysis. Table 1 shows the type, size, and format for datasets present in SIBEL [63].

Table 1 We present an approximate estimation of numerous computational networked epidemiology datasets available at the NDSSL

EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

Abstract

Similar content being viewed by others

Quarry: A User-centered Big Data Integration Platform

A computational framework for modeling and studying pertussis epidemiology and vaccination

UTLDR: an agent-based framework for modeling infectious diseases and public interventions

1 Introduction

1.1 Agent-Based Epidemiological Modeling and Analytics

1.2 The Need for a Knowledge Base

1.2.1 What Is a Case Study?

1.2.2 An Illustrative Example

1.2.3 A More Detailed Epidemic Investigation and Initial Challenges for SIBEL

1.2.4 Further Challenges as SIBEL Is Used and Grows

1.3 Contributions

1. Linked Data Access and Query Execution Framework

2. Native to RDF Mappings (Techniques and Their Trade-Offs)

3. Query Bank and Benchmarking

4. Evaluation of the Design Space for Homogeneous Data Access and Querying Framework

2 Related Work

3 The EpiK: Data Federation and Homogeneous Query Execution Framework

3.1 The Data Layer

3.1.1 Synthetic Input Data

3.1.2 Disease Manifestation and Intervention Data

3.1.3 Output Data

3.2 The Mapping Layer

3.2.1 Why RDF?

3.2.2 Mapping to RDF

3.2.3 Tuple-Based and Value-based Mappings

Preliminaries

Tuple-Based Mapping

Value-Based Mapping

3.2.4 Mapping Implementation

3.3 The RDF Engine and Services Layer

4 Query Bank

4.1 Native Queries Based on 5W1H Approach

4.2 Benchmark Queries

5 Empirical Analysis of EpiK

5.1 Experimental Design

5.2 Datasets

5.3 Software and Hardware Used

5.4 Creating the RDF Graphs

5.5 Results

6 Analytics with EpiK

Example 1: Understanding the Structure of the Dendograms

Example 2: Understanding Chains of Transmissions in Epidemics

Example 3: Investigating the Role of School Children in Disease Transmission

7 Concluding Remarks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Appendices

Appendix A: Synthetic Input Data

Appendix B: Disease Manifestation and Intervention Data

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation