A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases

Topuz, Kazim; Davazdahemami, Behrooz; Delen, Dursun

doi:10.1007/s10479-023-05377-4

A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases

Original Research
Published: 17 May 2023

(2023)
Cite this article

Download PDF

Annals of Operations Research Aims and scope Submit manuscript

A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases

Download PDF

1276 Accesses
1 Citation
Explore all metrics

Abstract

During a pandemic, medical specialists have substantial challenges in discovering and validating new disease risk factors and designing effective treatment strategies. Traditionally, this approach entails several clinical studies and trials that might last several years, during which strict preventive measures are enforced to manage the outbreak and limit the death toll. Advanced data analytics technologies, on the other hand, could be utilized to monitor and expedite the procedure. This research integrates evolutionary search algorithms, Bayesian belief networks, and innovative interpretation techniques to provide a comprehensive exploratory–descriptive–explanatory machine learning methodology to assist clinical decision-makers in responding promptly to pandemic scenarios. The proposed approach is illustrated through a case study in which the survival of COVID-19 patients is determined using inpatient and emergency department (ED) encounters from a real-world electronic health record database. Following an exploratory phase in which genetic algorithms are used to identify a set of the most critical chronic risk factors and their validation using descriptive tools based on the concept of Bayesian Belief Nets, the framework develops and trains a probabilistic graphical model to explain and predict patient survival (with an AUC of 0.92). Finally, a publicly available online, probabilistic decision support inference simulator was constructed to facilitate what-if analysis and aid general users and healthcare professionals in interpreting model findings. The results widely corroborate intensive and expensive clinical trial research assessments.

The role of artificial intelligence in healthcare: a structured literature review

Article Open access 10 April 2021

Revolutionizing healthcare: the role of artificial intelligence in clinical practice

Article Open access 22 September 2023

Deep learning modelling techniques: current progress, applications, advantages, and challenges

Article Open access 17 April 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Timely decision-making for efficient allocation of scarce healthcare resources during national or worldwide public health emergencies is critical (Lee et al., 2009). Notably, in the case of novel contagious diseases with mysterious aspects, gaining knowledge of potential demographic and chronic risk factors on time can significantly help the medical experts in designing and optimizing treatment and prevention protocols and bringing the outbreak under control (Davazdahemami et al., 2022; Li et al., 2020).

In the case of such novel diseases, while traditionally experimental clinical studies can identify the major risk factors over time, they are deemed inefficient in several aspects. First, most of these studies utilize small samples of patients and typically are limited to testing the validity of one or a few potential risk factors. Second, they are usually focused on specific patient groups (e.g., diabetic patients, pregnant women, senior adults, etc.) and this puts the generalizability of their finding under question. Third, such studies are more focused on descriptive aspects and less concerned with comparing the identified risk factors in terms of their relative importance. Fourth, given that their focus is typically limited to a few risk factors, the risk associated with interaction of multiple comorbid conditions is neglected in their analyses. Lastly, the settings of these past studies are exclusive to a certain disease; they do not lead to developing a comprehensive generalizable framework applicable in future health emergencies.

The current study attempts to address these shortcomings by presenting a universally applicable Exploratory–Descriptive–Explanatory (EDE) structure that can be utilized as a decision-making tool for identifying important chronic risk factors early in the course of the novel, complicated infectious diseases. The exploratory aspect of our framework employs a combination of the evolutionary heuristic search methods with predictive analytics techniques to perform a feature selection and identify an optimal subset of the most relevant chronic and demographic risk factors associated with a given novel disease. In the descriptive phase, we use the network and visual analytics tools to understand the patterns and relationships among the risk factors identified in the exploratory phase. In the explanatory phase of the framework, the selected features are used to develop a Bayesian Belief Network through which we assess the probability of survival for individual patients given their specific combination of demographics and chronic conditions. This is particularly crucial given the limitation of resources at the time of outbreaks and the importance of accurately identifying patients at high risk to allocate healthcare resources optimally. Additionally, this article introduces a decision support simulator, a web-based Bayesian inference tool, to assist decision-makers in exploring multidimensional interactions. The decision-support tool analyzes the effect of a specific combination of conditions based on patient-specific values/updates and is publicly accessible.

We use the COVID-19 pandemic as a case study to validate our proposed framework. We use a dataset of early COVID-19 cases in the United States. Despite using COVID-19 data for showcasing the proposed framework, we argue that it is a disease-agnostic framework and is meant to work for any similar health crisis.

The remainder of this article is organized as follows. In Sect. 2, the related studies from the literature are reviewed. Next, we describe the proposed framework and the specific data used to validate it. Following that, the feature selection and the Bayesian analyses are demonstrated. Finally, we discuss our findings and limitations, as well as future research directions.

2 Background

With the onset of the COVID-19 pandemic in early 2020, many research efforts began worldwide to understand the mechanism and risk factors of the novel mysterious virus and consequently to design and adjust prevention and treatment protocols. While several key risk factors of the disease were identified and reported through these studies, they generally suffer from one (or both) of the two main limitations as follows.

First, they primarily are focused on a single or a few risk factors. They do not account for the possible complications caused by the interaction of multiple health factors on the severity of the outcome. Such studies typically rely on standard statistical tests to find out a statistically significant difference between two different samples in terms of a given potential risk factor. For example, Peckham et al. (2020), using a sample of more than 3 million global COVID-19 cases, showed that the odds of intensive care needs and death in male patients are significantly higher than in females. Similar results were reported by Galbadage et al. (2020) in another meta-analysis study on the effect of gender on the severity of the disease. Also, using a similar approach, other studies identified several other risk factors for the disease, including age, obesity, hypertension, cardiovascular diseases, diabetes, pulmonary diseases, and cancer (Du et al., 2021; Földi et al., 2020; Hussain et al., 2020; Javanmardi et al., 2020; Mantovani et al., 2020; Parohan et al., 2020; Romero Starke et al., 2020). The risk factors investigated in these studies were all considered individually and in isolation from other possible confounders (e.g., demographics, comorbidities, medications, or public health factors).

Second, even those studies that account for interrelationships among various factors are mostly focused on employing such patterns for improving the predictability of the outcome (e.g., patients' survivability, hospital length of stay, intensive care need, among others) and are less concerned with the explanation of those relationships and their business and/or medical implications. For instance, Bekker et al. (2022) propose a new approach based on linear programming for predicting COVID-19 occupancy using only aggregated data as opposed to individual health records. In another predictive analytics study, Nikolopoulos et al. (2021) provide tools for forecasting the country-level COVID-19 growth rates to plan supply chain demand for related products and services. Similarly, Eryarsoy et al. (2020) propose a novel approach (based on the well-established marketing diffusion models) for estimating the number of cases, hospitalizations, ICU admissions, and fatalities with minimal data requirements for application in the public health policy development at the early stages of pandemics where not enough data is yet accumulated to estimate the common epidemiology indices. Several other studies have employed various machine learning, time series, and simulation techniques to predict hospitalization demands (Alakus & Turkoglu, 2020; Deschepper et al., 2021; Huang et al., 2020), mortality (Cui et al., 2021; Davazdahemami et al., 2022; Taylor & Taylor, 2021), and vaccination supply chain requirements (Currie et al., 2020; Nagurney, 2021; Sinha et al., 2021).

However, the mentioned shortcomings are not limited to the recent pandemic studies only. A review of research conducted on the three other major pandemics and epidemics in the past decade, namely the H1N1 Swine Flu pandemic in 2009, the Ebola virus disease epidemic in 2014, and the Zika virus disease epidemic in 2015, confirms that they were also mainly focused on either investigating single risk factors or interrelation of risk factors with emphasis on improving the predictability of outcomes. In a study of the risk factors of the H1N1 flu pandemic, O’Riordan et al. (2010) have compared the charts of 58 children with H1N1 with those of children with seasonal influenza A using standard statistical tests and showed that factors such as age and underlying conditions were significantly different among the two groups. Similarly, using a case–control approach Ribeiro et al. (2015) showed that age, obesity, and immunosuppression increase the odds of death due to H1N1 flu. Several other studies have employed similar bivariate statistical approaches to look into the risk factors of that disease (Baumeister et al., 2003; Campbell et al., 2010; Gilca et al., 2011; Hanslik et al., 2010; Lenzi et al., 2012) as well as Ebola virus disease (Hartley et al., 2017; Wing et al., 2018; Wirth et al., 2016; Xu et al., 2016) and Zika virus disease (de Araújo et al., 2018; Goodman et al., 2016; Smith & Mackenzie, 2016; Ventura et al., 2016).

In the predictive analytics realm also, several studies sought to perform early detection of these diseases (Akhtar et al., 2019; Kakulapati et al., 2021; Mahalakshmi & Suseendran, 2019; Thakkar et al., 2010), predict mortality or severity (Colubri et al., 2019; Pandey & Subbiah, 2018), and modeling the spread of diseases (Jiang et al., 2018; Zhang et al., 2015). The number of data-driven machine learning studies for the past pandemics/epidemics are considerably smaller than those of the COVID-19, most probably due to their relatively limited outbreak and advances in the data collection and analysis infrastructures and methods in methods recent years. Nevertheless, what seems to be missing in the literature on data-driven studies of pandemics is a comprehensive methodology for early detection of the chronic risk factors considering the confounding roles of comorbid conditions in estimating the severity and mortality of novel viral diseases. Bayesian networks (BNs) are advantageous since they provide researchers with a causally correct method for exploring the domain with input from subject matter experts and disengaging statistical correlation and causal effects (Pearl, 2009, 2014). BNs have gained popularity among machine learning models because of their use of graph and probability theory principles, enabling non-technical subject matter experts to analyze big data. (Topuz et al., 2021b).

The present study contributes to the extant literature by proposing a holistic framework for both identifications of chronic comorbidity risk factors (increasing the odds of mortality) through an evolutionary search process as well as an explanation of the relative risk of those factors either individually or in combination with each other by creating BNs and presenting it on a publicly available web simulator. BNs actively use graph and probability theory concepts, making data science intuitive for non-technical subject matter experts. To be realistic and applicable, our framework is particularly designed with two key assumptions in mind, namely:1) novelty of the disease and no prior knowledge about its risk factors; and 2) limitation of data availability in the early stages of an epidemic/pandemic and the need for quick and accurate decision support tools for planning the limited resources and adjust treatment protocols.

3 Methods and materials

3.1 Data

We used early US pandemic data from the Cerner HealthFacts data warehouse, one of the most comprehensive EHR platforms in the United States with clinical records from more than 63 million unique patients collected since 2000. The data warehouse structure involves a fact table containing general visit (a.k.a. encounter) and patients’ features with several dimensions containing details on diagnoses, medications, lab test, medical facilities, and operations, just to name a few. The COVID-19 related visits (both inpatient and emergency) between December 2019 and June 2020 were extracted. We eliminated visits from patients who were not adults (above 18). The patients in the sample had all been diagnosed with COVID-19 using one or more diagnostic lab tests.

Next, we performed a one-hot-encoding of the patients' historical chronic conditions (the corresponding ICD-10 identifier for each condition was used for consistency). Chronic conditions were identified using the Chronic Condition Indicator (CCI) list for ICD-10-CM provided by the Agency for Healthcare Research and Quality.^{Footnote 1} Acute conditions were excluded from the scope of this study since they could be resulted from the COVID-19 (i.e., as a consequence or symptom) as opposed to being risk factors. This resulted in 1047 binary features, each indicating the existence of a chronic condition in the health records of a patient. Additionally, the final dataset includes patients' demographics and encounter-specific data (i.e., patient age at the encounter, admission type, and payer type) to be examined for their prospective risk level. We also included two numeric features showing the total number of diagnoses at the admission and the total number of unique historical diagnoses in each patient’s records as two proxies of their general health condition. Table 1 shows a summary of the features extracted from the patients’ records by category.

Table 1 Summary of the predictive features

Full size table

Finally, a binary target variable was derived to show if a given patient survived COVID-19 (i.e., survived = 0; deceased = 1). Our data revealed a 9.6 percent mortality rate (982 out of 10,189) for COVID-19, which is significantly greater than the average rate (2–3 percent) reported by the Johns Hopkins Coronavirus Resource Center (2021). We believe that although this discrepancy may be partially explained by the absence of asymptomatic patients (or those who self-quarantined at home) in our data, it may also be explained by the disease's novelty and insufficient resources to control the epidemic during the early phases of the pandemic. Table 2 summarizes the demographics of the patients. Also, in Fig. 1, we have demonstrated the proportion contrasts of different conditions (by category) among the two classes of patients.

Table 2 Summary of the patients’ demographics

Full size table

3.2 Methodology

Exploratory, descriptive, and explanatory research are incorporated into the proposed system's conceptual design, using a genetic algorithm and a Bayesian belief network. Seven steps (three phases) of development are followed in order to create an efficient framework for developing comorbidity analysis, which includes (1) combining datasets from multiple data sources, (2) dealing with data issues like missing values and multicollinearity, (3) feature engineering with genetic algorithm, (4) understanding patterns, and inter-relationships and graphical representation of data, (5) developing a probabilistic model to estimate survival joint probability, (6) designing a publicly available decision inference tool to perform what-if scenarios, (7) interpreting explanatory results and updating exploratory phase knowledge. Figure 2 illustrates the proposed methodology graphically. The following sections detail each of these phases.

3.3 Exploratory phase: feature selection

Considering the primary goal of the proposed framework (i.e., risk identification early in epidemics or pandemics), we made the assumption that none of the COVID-19 risk elevators were known before conducting this study. In other words, we initially assumed each chronic condition as an equally important potential risk factor contributing to the likelihood of death by COVID-19, resulting in a high-dimensional data set (i.e., a binary flag feature representing each condition). This large number of features is known as a major challenge in bioinformatics studies (Alexe et al., 2006). It is shown in the literature that evolutionary search algorithms are efficient approaches for finding an optimal, reasonably small subset of features with minimal harm to the distinctive power and informativeness of the data (Fan & Chaovalitwongse, 2010; Mehmanchi et al., 2021; Şeref et al., 2018).

In line with the literature, to address the high dimensionality issue of our dataset, we employed an evolutionary heuristic search approach, namely the Genetic Algorithm (GA) (Holland, 1992), to find a nearly optimal subset of variables. GA uses a chromosome-like string to represent each potential solution in an optimization problem (i.e., a set of features in this paper); it then uses a set of rules to randomly generate a population of feasible chromosomes. Through several generations (iterations), the top chromosomes of each population (based on their fitness function score) are maintained for the following generation, and new chromosomes replace the rest. The process of generating new child chromosomes is partially random, and the rest through the mating of the top parent chromosomes from the previous generation (i.e., crossover of two chromosome strings or mutating a single chromosome). The procedure of developing additional generations is repeated until the variations in the best fitness value are minimal across several successive generations (i.e., the convergence of the algorithm).

GA forward selection approach was used with a population size of 1000 solutions in each generation to refine the feature set. The data was split on a chronological basis, with the data of COVID patients identified before April 22, 2020 (around 79.5% of the sample) being used for training and the rest of the data used for validation purposes. Each initial population’s solution of the GA contained an arbitrary subset of 50 out of the 1407 chronic conditions of the data set. The chromosome length was determined after running the algorithm multiple times with different lengths (between 30 and 200) and observing no major improvement for those longer than 50 features. In each generation, the features selected by the algorithm were employed to train a random forest (RF) classification model with basic settings using the training subset of data. We noted the model performance on the validation subset. RF was chosen for feature selection given having too many binary features in the data; prior research has shown that tree-based methods are powerful algorithms in the existence of several binary categorical features (Davazdahemami et al., 2022).

The area under the receiver operating characteristic curve (AUC) was employed as the fitness function to select the top feature sets in each generation. AUC was chosen (among other evaluation metrics) since it indicates the distinctive capability of each selected feature set in identifying the two classes of patients. We utilized a tournament strategy as the selection process of the algorithm; the survival rate, elitism rate, crossover rate, and mutation rate were set to 20%, 30%, 30%, and 10%, respectively. We let the algorithm train for a maximum of 100 generations with an early stop possibility in case of convergence and showing no improvements in 10 consecutive iterations. Finally, for descriptive analysis, the best feature set associated with the greatest AUC was retained (phase 2), and subsequently developing the Bayesian Belief Network model for explaining the relative individual and combined risk of each identified factor.

The GA procedure pursued for feature selection is summarized in the pseudo-code shown in Fig. 3.

3.4 Descriptive phase: understanding patterns and inter-relationships

In the second phase of our framework, we take advantage of descriptive analytics tools to dig into the details to obtain some initial insights into the relationship patterns among the selected features with the target variable.

Specifically, we first look into the relative survival and death rates for patients suffering from each single selected chronic condition. Given the feature selection criteria explained above, we expect to observe relatively high diagnostic power for each feature, reflected as a significant difference between their corresponding survival and death rates. Hence, in addition to getting insights at the bivariate level, this can be considered a validation mechanism for the feature selection step.

Second, we use descriptive tools to investigate the conditions' pairwise comorbid relationships with the survival factor. For each pair of the selected chronic conditions, we look into the frequency of cases diagnosed with the novel disease as well as the relative death rate. This can provide us with insights into patients' vulnerability with a history of each comorbid pair in terms of contracting the disease and mortality.

3.5 Explanatory phase: probabilistic prediction model and inference simulator

3.5.1 Probabilistic models and Bayesian belief networks

Since its origin (Wright, 1934), probabilistic graphical models built from DAG (Directed Acyclic Graphs) have come long. Its variations have found use in a wide variety of fields. Bayesian Networks (BNs) are used to solve various issues in machine learning and cognitive science. While BNs may be used to investigate the domain in a causally accurate manner, they can also be used for the explanatory purpose to disentangle statistical correlation from causal effects. Essentially, the BN model is a directed acyclic graph (DAG). The nodes correspond to the relevant variables (e.g., whether the patient has dementia with behavioral disturbance and/or unspecified atrial fibrillation). The conditional dependence between variables is symbolized by arcs (Pearl, 1988). The arc directions illustrate the dynamics between a parent and child—an arrow connecting A and B, where A is the parent node of B, indicates that the probabilities of B are conditionally tied to the values of A. Another valuable method to conceptualize the Bayesian model is via the lens of hypothetical events. Each arc represents a statement about the outcome of a hypothetical event; if we could wiggle A, we would expect to notice a change in the probability of B. Assuming no other variables exist, Fig. 4 illustrates a simple DAG reflecting the relationship between × 1: the patients' age, × 2: their status of dementia with behavioral disturbance, and × 3: their status of unspecified atrial fibrillation. Three distinct patterns may be considered in this context: Since age is a prevalent cause or confounding variable of dementia with behavioral disturbance (1) and unspecified atrial fibrillation (2), individuals with dementia with behavioral disturbance often have unspecified atrial fibrillation. However, the association is not causal—giving a patient with dementia with behavioral disturbance does not automatically result in unspecified atrial fibrillation; instead, both conditions are described by a third factor (3), the patient's age.

We are interested in two types of probabilities. The marginal probability is the probability that it will occur regardless of the result of another variable, a node that has no parents. The conditional probability is one event happening in the presence of another event, a node that has parent(s). Conditional probability tables (CPTs) link parent and child node states and have entries for all possible child and parent node states compositions. Following the conditional dependencies among these variables, BN offers a simple way to assess the joint distribution of a set of random states; hence one can calculate the conditional probabilities of any subset of factors. Furthermore, the graphical aspect of Bayesian networks is helpful since it allows users to receive a visual perspective of the subject at hand. To simplify complex joint probabilities, the BN chain rule is used (Koller & Friedman, 2009):

$$ P\left( {x,...,x_{n} ) = \mathop \prod \limits_{i = 1}^{n} P\left( {\left. {x_{i} } \right|Y_{{x_{i} }} } \right)} \right), $$

where each $x_{i}$ denotes child variables, and $Y_{{x_{i} }} $ represents parent(s). Figure 4 illustrates this in action: P (× 3× 2, × 1) is the probability of unspecified atrial fibrillation given the value of the patient’s age (it is independent of dementia with behavioral disturbance when conditioning age).

In order to develop a BBN model, there are two options: (1) a manual approach, more subjective based on expert opinion (2) a data-driven analytical approach, which induces the structure utilizing advanced mathematical models (Koller & Friedman, 2009). Since CPTs are calculated for all states of the parent nodes to those of a child node, their size increases exponentially. Consider a single child node that has m parents and both child and parents have n states; the total number of entries in the CPT becomes $n^{m + 1}$. For example, a simple CPT with m = 2 and n = 5, the CPT requires 625 entries. Thus, eliciting a simple-sized network requires a set of domain experts to study for many hours, which would be expensive and time-consuming (Korb & Nicholson, 2010). Furthermore, expert reasoning may elicit biased and doubtful outcomes due to differing viewpoints.

Numerous earlier research works have shown a variety of approaches for inducing the structure using data. The Naive Bayes method is a simple technique that uses the Bayes rule and requires computing the probability of the class/target number for each input value. Naïve Bayes limits the structure with an assumption that all variables are independent. The TAN model, an upgraded version of the naive Bayes, implements a tree-structure model (Friedman et al., 1997). The TAN model builds on the Naive Bayes model by adding a degree of interaction between the system's characteristics. Each attribute in the TAN model relies on its class and one additional variable from the variable set (see Fig. 4). Since this model integrates attribute dependencies, it is more realistic than a Naive mode and outperforms other data-driven structural learning techniques such as Naive Bayes and Markov blanket (MB). Korb and Nicholson's paper has further examinations and specifics on comparing BN structures- naïve Bayes, TAN, and MB (Korb & Nicholson, 2010).

In our context, we evaluated several structural learning algorithms and chose the TAN model due to its superior performance. After constructing the structure, BN may act as an inference simulator, allowing for the extraction of all domain relations via simulation. Using CPTs, the reasoning would be more effective despite multiple unknowns, and one can conduct Omni-directional inference using the BN decision support simulator. Eliciting inference from simulation inside a BN is time-consuming and requires many resources (Topuz et al., 2018b). However, technologies have been applied that are capable of doing the necessary tasks in the background, and these algorithms are all capable of being implemented successfully in the inference simulator. For this purpose, we present the findings using a web-based inference simulator (see results and discussion session).

3.5.2 Understanding uncertainty in probabilistic modeling- mutual information and entropy concepts

Shannon’s mathematical theory of communication shows us entropy could be used to quantify information content (Shannon, 1948). In information theory, entropy provides a mathematical way of quantifying uncertainty. A greater entropy value suggests a greater degree of informational uncertainty; or, more practically, a greater number of potential outputs of a function. We would want thorough information about a patient to generate a credible risk assessment in our case. However, would we be absolutely unsure of its worth if we did not have all the patient-specific knowledge, say missing information about the heart condition! Even if we did not know anything about the patient's specific heart condition, we might have the patient's "age_at_encounter,” “payer_information,” and other chronic conditions. The risk of being deceased can be updated with the given information. Knowing “age_at_encounter" leads to an information gain, and the "average information gain" would allow you to reveal the predictive significance of observing these variables. We can calculate the target variable's conditional entropy given the predictive variable as follows:

$$ H\left( {Deceased|Age\_at\_encounter} \right) = \mathop \sum \limits_{i} P\left( {Deceased} \right)H\left( {Deceased_{i} | Age\_at\_encounter_{i} } \right) $$

where H denotes the entropy, the difference between the marginal and the conditional entropy, given that the predictive variable is formally known as Mutual Information (MI). In our context, the MI is the marginal entropy of "Deceased" minus the conditional entropy of "Deceased" given "Age at encounter" it can be formally defined as:

$$ MI\left( {Deceased,Age\_at\_encounter} \right) = H\left( {Deceased} \right) - H(Deceased| Age\_at\_encounter) $$

In general form, MI is:

$$ MI\left( {A,B} \right) = H\left( A \right) - H(A | B), $$

which translates as:

$$ {\text{MI}}\left( {{\text{A}},{\text{B}}} \right) = \mathop \sum \limits_{a \in A} P\left( a \right)\mathop \sum \limits_{b \in B} P\left( {b|a} \right)\log_{2} \frac{P(b|a)}{{P\left( b \right)}}. $$

The above enables us to measure the MI among any variables. In other words, we may identify which variable generates the most information gain and so has the most predictive significance.

3.5.3 Validation framework and performance evaluation

Several measures for "true performance" of dependent variables models (Sokolova & Lapalme, 2009) and "true understanding" of complicated relationships (Gonzalez-Lopez et al., 2019; Topuz et al., 2018a) have been developed in the literature. A full extent of the performance is essential, as is an insight into the problem's complicated structure. Thus, this article offers (a) measures for assessing overall performance (including AUC, accuracy, and dependability), as well as (b) measures for illustrating interrelationships and conditional interdependence (reporting MI and entropy).

One of the most often used estimation techniques for classification-type data mining models is to split the data into training and testing sets, a process known as training–testing split (Delen et al., 2020). Cross-validation and bootstrap resampling are two popular implementations of this possible method presently (Benbasat & Nault, 1990; Delen, 2010). The stratified k-fold cross-validation procedure randomizes the entire dataset (all samples/rows) and then divides it into k distinct subsets, each with a nearly equal number of samples/rows and roughly similar dependent variable classes (Fig. 5a). The model is trained on the first k-1 subsets of the given dataset and then evaluated on the last subset. This experimental technique is repeated k times (i.e., training preceded by testing). Each iteration includes a new fold as the test dataset and the remaining folds as the training dataset, repeating this process until the desired conclusion is reached. The classifier's overall performance is then computed by taking a simple average of the performance measurements obtained from each k individual test sample. A bootstrap sample is formed by resampling n cases using replacement and then resampling the sample again. Therefore, duplicate instances may appear in a bootstrap sample (Fig. 5b). After n samples, the likelihood of any given case not being taken can be estimated: $(1 - 1/n)^{n} \approx 1/e \approx 0.368$. In test sets, the anticipated number of unique occurrences from the initial dataset can be computed: 0.632 n.

Kim (2009) demonstrates how the bootstrap estimator suffers from bias when using a small sample size. In order to reduce the impact of bias imposed by random sampling, k-fold cross-validation can be employed. However, when the sample size is not big enough, Kohavi (1995) explains how the k-fold cross-validation method suffers from variance. To present a complete picture of a credible estimate, this study includes both resampling procedures, k-fold stratified cross-validation, and bootstrap.

3.5.4 WebSimulator- inference generator

This study provides a web-based inference calculator (WebSimulator) that links data, analysis, and computing. Specifically, instead of using single-point predictions, we develop a BN model that explicitly accounts for uncertainty in all assumptions. The inference engine that powers the WebSimulator output is this BN and the objective is to build a model from data and predict Covid death probability. WebSimulator has grown into a decision aid that allows the user to access and share data within the BN via visualization, modeling, and interactive analytical approaches that enhance interpretation and decision-making. By using the WebSimulator as a platform for broadcasting inter-active models over the internet, this research allows end-users to engage with scenarios and get a better grasp of the model's underlying dynamics, so making it more accessible to a broader audience. Importantly, any such "machine-learned solutions" may be readily duplicated by stakeholders, who can use the WebSimulator to test out different alternative assumptions and policy scenarios.

4 Results and discussion

4.1 Summary of exploratory analysis

4.1.1 Feature selection

The feature selection algorithm was executed on a PC with a 2.90 GHz processor and 64 GB of main memory. The model convergence was obtained after 29 generations which took roughly 4 h of processing time. Table 3 indicates a summary of the 50 selected conditions by category (some conditions belonged to multiple categories).

Table 3 Selected chronic risk factors by category

Full size table

Generally, we identified the same groups of risk factors as those noted in meta-analytic studies of the literature. Nevertheless, those studies were mostly conducted months after the onset of COVID-19 in the US. Moreover, we noticed some chronic diseases (categorized as “other”) that are typically prevalent among senior patients (e.g., arthritis, dementia, and cognitive impairment), which suggests that their inclusion among the selected features by the genetic algorithm could be attributed to patients’ age instead of the disease itself. This has been pointed out by MacLeod and Hunter (2021), who maintain that in the studies of COVID-19, its age-dependent effects must be taken into account.

4.2 Summary of descriptive analysis

First, we assess the bivariate relationships among each individual chronic condition with the COVID-19 survivability of the patients suffering from that. Figure 6 shows the proportion of survived vs. dead COVID patients who suffered from each of the chronic features (ICD-10 codes) selected through the GA process.

As shown, a majority of the selected features are remarkably different in terms of the COVID-19 survival rate. This indicates the relatively high diagnostic power of the selected feature set that, considering the AUC criterion (with emphasis on maximizing diagnostic power of the prediction model) for the GA feature selection, suggests the efficacy of the feature selection approach.

At the individual condition level, the chart suggests that amyloidosis (ICD-10: E58.5), postprocedural hypopituitarism (ICD-10: E89.3), nonautoimmune hemolytic anemias (ICD-10: D59.4), and cardiac arrest (ICD-10: I46.9) are the top fatal risk factors.

While studying risk factors at the individual level could be insightful, many complications arise when two or more conditions develop at the same time. Figure 7 indicates the COVID-19 mortality rate of comorbid conditions resulting from pairing up the selected chronic features. The heatmap suggests interesting potential complications at the pair level of comorbidity. For instance, whereas the occlusion and stenosis of cerebral arteries (ICD-10: I66.0) at the individual level do not seem to be diagnostic-aiding at all (i.e., 50% survived, 50% deceased), at the pair level, 100% of patients who had that condition along with vitamin D deficiency (ICD-10: E55.9) have died. Similarly, while gout (ICD-10: M10.9) and hypertensive heart and renal disease (ICD-10: I13.0) individually involved mortality rates of 16.4% and 38.1%, respectively, patients who developed both conditions turned to have a mortality rate of 83%. As another example, thrombocytopenia (ICD-10: D69.6) has an individual death rate of 25.6%, whereas its comorbidity with no-insulin-dependent diabetes mellitus (ICD-10: E11.1) and hypertensive heart and renal disease (ICD-10: I13.0) increase its corresponding death rate to 100% and 78%, respectively.

Whereas complications associated with pairwise comorbidities are insightful, in the next stage employing the Naïve Bayes Network approach, we step up and look into more complicated confounding patterns resulting from multiple comorbid conditions.

4.3 Summary of explanatory analysis

BNs can be used in a wide range of real-world situations because they can help with analyses when there is a high degree of uncertainty. This study includes both resampling procedures, k-fold stratified cross-validation, and bootstrap to present a comprehensive picture of a credible estimate. In other words, we produced twenty different probabilistic inference models: ten for k-fold and ten for bootstrapping. Table 4 shows the average of ten folds predictive performance. Overall, the bootstrap results did better than the k-fold results (Mean ROC: 90.8 percent vs. 85.8 percent).

Table 4 Cross-validated classification performance

Full size table

Typically, in traditional statistical analysis, the correlation and covariance are explored to understand their relative significance, specifically for the target variable. The current study offers an alternative technique, based on information theory (Ehsani et al., 2010; Topuz et al., 2021a), for determining how observing a variable improves the uncertainty of the probability of a novel disease (see Sect. 3.5). Here we employed the information gain concept -using entropy and MI- to measure the uncertainty and determine which variable has the most predictive power. The BN shown in Fig. 8 illustrates how the states on nodes represent MI with the dependent variable, representing the maximum information gain between nodes.

Patient’s age (Age at encounter- 0.078) is the most significant predictor, followed by number of diagnoses the patient has (numDiagnosis- 0.056) and type of patient’s encounter (encounter_type- 0.052). Among the comorbidities, the patient’s cardiac arrest (unspecified) status has the most information gain (I46.9- 0.003), followed by the patient’s encephalopathy status (G93.4- 0.002), and the patient’s unspecified dementia without behavioral disturbance (F03.9- 0.001). A comprehensive list of all comorbidities and factors can be found in Appendix A with their relative significance.

4.3.1 Omnidirectional inference with web simulator

We created an example using omnidirectional inference and ran several simulations on various network components to show how to interpret the conditional dependencies. The marginal probabilities are depicted in Fig. 9 as horizontal bars, along with the amount of information received by observing the age variable (second row), cardiac arrest (I46.9) (third row), and encephalopathy (G93.4) (the bottom row). Mortality has marginal probabilities of 90.5% (= 0- not deceased) and 9.5% (= 1- deceased). Updating the information on variable age (age > 78), the probability of being deceased increases to 29.9% compared to the marginal probability of 9.5%. Similarly, if we have information that the patient has a cardiac arrest, the probability of being deceased increases to 84.7%. On the other hand, observing cardiac arrest changes the distribution of age mean from 54 to 70. Updating the belief on both ages (> 78) and cardiac arrest (= 1) increases the probability of being decreased to 93%. In addition, the probability of encephalopathy increases by 15%. Indeed, all the probabilities in the graphical model's other components have been changed (even those not shown in Fig. 9).

Through WebSimulator, a list of all interactions in the domain can be revealed. Most importantly, despite several unknowns, reasoning can be done effectively regarding the application domain (Topuz et al., 2018a, 2021). Since the model is published online, users may play around with various scenarios and examine the model's dynamics and explore the omnidirectional inference. The simulator has evolved through visualization, modeling, and analysis into an instructional tool capable of efficiently extracting and disseminating knowledge inside the PGM, acting as a gateway connecting human and artificial intelligence. To interact with the model and analyze risk factors for a novel, contagious disease in a specific patient with knowledge of demographic and chronic factors, a healthcare provider can use the WebSimulator, an adapted version of the predictive analyzer, which is publicly accessible at https://simulator.bayesialab.com/#!simulator/59628240012.

5 Conclusion

Understanding the main chronic risk factors, which allows for more effective allocation of scarce resources and improving treatments, is among the top public health management priorities during the outbreak of a novel viral disease. Whereas clinical studies in a controlled environment are considered the most reliable technique in detecting such risk factors, their considerable time requirements may result in the loss of many lives, apart from the negative social and economic impacts which may arise from long-term lockdowns required to bring the outbreak in control. When there are epidemics or pandemics, predictive and prescriptive analytics can help medical experts and policymakers figure out which chronic risk factors are most important to keep an eye on. This is critical both in adjusting proper preventive measures as well as in identifying patient groups at high risk of mortality to be prioritized for the limited available health resources.

The three-phased -exploratory, descriptive, and explanatory (EDE)- methodology proposed in this study combines various data analytics tools. In this study, we used genetic algorithms, Bayesian networks, and innovative model interpretation approaches to create a complete EDE methodological approach that may benefit clinical decision-makers in responding quickly to a pandemic. We showcased the effectiveness of the proposed approach by implementing it on early COVID-19 data obtained from a large EHR repository in the United States.

Even though risk factors discovered by our proposed technique have already been mentioned in the past studies, what distinguishes the current study is the capacity introduced by the proposed approach to acquire highly similar findings within a considerably shorter period. For example, while the earliest published clinical studies discussing the role of Diabetes mellitus type II and Thrombocytopenia in increasing the mortality odds of COVID-19 patients are dated back to May (Bloomgarden, 2020) and April of 2020 (Yang et al., 2020), we could not find studies mentioning the increased risk of mortality in patients having both conditions together, published earlier than late July 2020 (Zhang et al., 2020). That is, using our proposed approach, the risk associated with that comorbidity combination could have been discovered around a month earlier than it was. Additionally, for some comorbid pairs of chronic conditions like Diabetes and Mitral insufficiency we could not find any studies indicating the increased mortality rate associated with them in COVID-19 patients, even until mid-2022, although they both have been identified as independent risk factors by around mid-June of 2020.

These timelines could have been shortened even further if international institutions responsible for global health policies develop proper monitoring and information governance techniques for future pandemics and effectively use social networks for crowdsourcing potential risk factors and symptoms from the general population (Zolbanin et al., 2021). For example, while we were able to reliably establish the risk variables using data from early US cases, highly comparable results could have been acquired sooner provided that there existed an efficient international health data exchange infrastructure allowing for a faster flow of valid information. There are two primary ways in which we believe that earlier detection of risk variables would have resulted in lower total mortality. To begin, more accurate information could have been supplied earlier in the epidemic to high-risk patients. Second, better outcomes may have been achieved by more efficiently prioritizing and administering scarce healthcare resources. The majority of COVID-19 chronic risk factors might have been spotted as early as the third week of March 2020 (considering the earlier 80% of COVID-19 cases in our data chronologically for training the prediction model), about the time the World Health Organization labeled the infection a global pandemic. That is significantly earlier than when risk elements were identified and reported in peer-reviewed journals. Studies including this one show how machine learning methods can save time when looking for early signs of illness; we think that major health agencies around the world should collaborate and establish the infrastructure and skills needed to do data analysis early on in future similar situations.

In fact, the primary contribution of this study is proposing a framework to discover risk factors of novel diseases (and the role of comorbid risk factors) while requires minimal prior knowledge about those diseases. Even though most (and not all) of the risk factors were discovered within 6–12 months after the onset of the pandemic, we argue that using our proposed framework they could have been discovered several months earlier. This could have led to saving thousands of lives across the world just by taking more informed medical decisions. Particularly given the limited amount of resources during outbreaks a more accurate list of risk factors (and their potential comorbidity interrelationships) can considerably help the healthcare authorities in adjusting their resource allocation priorities accordingly.

While our study sheds light on the benefits of the machine learning models in the early detection of chronic risk factors, there are several limits to our study and thus opportunities for additional future research. For instance, the utilization of data obtained from a single data source may raise concerns about the generalizability of findings. Hence, we encourage future researchers to validate our proposed framework using data from other EHR repositories and/or possibly from a combination of different repositories and perform a comparative assessment of the identified risk factors thereof.

Notes

https://www.hcup-us.ahrq.gov/toolssoftware/chronic_icd10/chronic_icd10.jsp.

References

Akhtar, M., Kraemer, M. U. G., & Gardner, L. M. (2019). A dynamic neural network model for predicting risk of Zika in real time. BMC Medicine, 17(1), 1–16.
Article Google Scholar
Alakus, T. B., & Turkoglu, I. (2020). Comparison of deep learning approaches to predict COVID-19 infection. Chaos, Solitons & Fractals, 140, 110120. https://doi.org/10.1016/j.chaos.2020.110120
Article Google Scholar
Alexe, G., Alexe, S., Hammer, P. L., & Vizvari, B. (2006). Pattern-based feature selection in genomics and proteomics. Annals of Operations Research, 148(1), 189–201. https://doi.org/10.1007/s10479-006-0084-x
Article Google Scholar
Baumeister, R. F., Campbell, J. D., Krueger, J. I., & Vohs, K. D. (2003). Does high self-esteem cause better performance, interpersonal success, happiness, or healthier lifestyles? Psychological Science in the Public Interest, 4(1), 1–44.
Article Google Scholar
Bekker, R., Broek, M., & Koole, G. (2022). Modeling COVID-19 hospital admissions and occupancy in the Netherlands. European Journal of Operational Research. https://doi.org/10.1016/j.ejor.2021.12.044
Article Google Scholar
Benbasat, I., & Nault, B. R. (1990). An evaluation of empirical research in managerial support systems. Decision Support Systems, 6(3), 203–226.
Article Google Scholar
Campbell, A., Rodin, R., Kropp, R., Mao, Y., Hong, Z., Vachon, J., Spika, J., & Pelletier, L. (2010). Risk of severe outcomes among patients admitted to hospital with pandemic (H1N1) influenza. CMAJ, 182(4), 349–355.
Article Google Scholar
Colubri, A., Hartley, M.-A., Siakor, M., Wolfman, V., Felix, A., Sesay, T., Shaffer, J. G., Garry, R. F., Grant, D. S., Levine, A. C., & Sabeti, P. C. (2019). Machine-learning prognostic models from the 2014–16 Ebola outbreak: Data-harmonization challenges, validation strategies, and mHealth applications. EClinicalMedicine, 11, 54–64. https://doi.org/10.1016/j.eclinm.2019.06.003
Article Google Scholar
Cui, S., Wang, Y., Wang, D., Sai, Q., Huang, Z., & Cheng, T. C. E. (2021). A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Applied Soft Computing, 113, 107946. https://doi.org/10.1016/j.asoc.2021.107946
Article Google Scholar
Currie, C. S. M., Fowler, J. W., Kotiadis, K., Monks, T., Onggo, B. S., Robertson, D. A., & Tako, A. A. (2020). How simulation modelling can help reduce the impact of COVID-19. Journal of Simulation, 14(2), 83–97. https://doi.org/10.1080/17477778.2020.1751570
Article Google Scholar
Davazdahemami, B., Zolbanin, H. M., & Delen, D. (2022). An explanatory analytics framework for early detection of chronic risk factors in pandemics. Healthcare Analytics. https://doi.org/10.1016/j.health.2022.100020
Article Google Scholar
de Araújo, T. V. B., de Ximenes, R. A. A., de Miranda-Filho, D. B., Souza, W. V., Montarroyos, U. R., de Melo, A. P. L., Valongueiro, S., de Albuquerque, M. F. P. M., Braga, C., Filho, S. P. B., Cordeiro, M. T., Vazquez, E., di Cruz, D. C. S., Henriques, C. M. P., & Felix, O. V. (2018). Association between microcephaly, Zika virus infection, and other risk factors in Brazil: Final report of a case-control study. The Lancet Infectious Diseases, 18(3), 328–336. https://doi.org/10.1016/S1473-3099(17)30727-2
Article Google Scholar
Delen, D., Topuz, K., & Eryarsoy, E. (2020). Development of a Bayesian Belief Network-based DSS for predicting and understanding freshmen student attrition. European Journal of Operational Research, 281(3), 575–587. https://doi.org/10.1016/j.ejor.2019.03.037
Article Google Scholar
Delen, D. (2010). A comparative analysis of machine learning techniques for student retention management. Decision Support Systems, 49(4), 498–506.
Article Google Scholar
Deschepper, M., Eeckloo, K., Malfait, S., Benoit, D., Callens, S., & Vansteelandt, S. (2021). Prediction of hospital bed capacity during the COVID−19 pandemic. BMC Health Services Research, 21(1), 468. https://doi.org/10.1186/s12913-021-06492-3
Article Google Scholar
Du, Y., Zhou, N., Zha, W., & Lv, Y. (2021). Hypertension is a clinically important risk factor for critical illness and mortality in COVID-19: A meta-analysis. Nutrition, Metabolism and Cardiovascular Diseases, 31(3), 745–755.
Article Google Scholar
Ehsani, M., Makui, A., & Nezhad, S. S. (2010). A methodology for analyzing decision networks, based on information theory. European Journal of Operational Research, 202(3), 853–863.
Article Google Scholar
Eryarsoy, E., Delen, D., Davazdahemami, B., & Topuz, K. (2020). A novel diffusion-based model for estimating cases, and fatalities in epidemics: The case of COVID-19. Journal of Business Research. https://doi.org/10.1016/j.jbusres.2020.11.054
Article Google Scholar
Fan, Y.-J., & Chaovalitwongse, W. A. (2010). Optimizing feature selection to improve medical diagnosis. Annals of Operations Research, 174(1), 169–183. https://doi.org/10.1007/s10479-008-0506-z
Article Google Scholar
Földi, M., Farkas, N., Kiss, S., Zádori, N., Váncsa, S., Szakó, L., Dembrovszky, F., Solymár, M., Bartalis, E., & Szakács, Z. (2020). Obesity is a risk factor for developing critical condition in COVID-19 patients: A systematic review and meta-analysis. Obesity Reviews, 21(10), e13095.
Article Google Scholar
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163.
Article Google Scholar
Galbadage, T., Peterson, B. M., Awada, J., Buck, A., Ramirez, D., Wilson, J., & Gunasekera, R. S. (2020). Systematic review and meta-analysis of sex-specific COVID-19 clinical outcomes. Frontiers in Medicine, 7, 348.
Article Google Scholar
Gilca, R., De Serres, G., Boulianne, N., Ouhoummane, N., Papenburg, J., Douville-Fradet, M., Fortin, E., Dionne, M., Boivin, G., & Skowronski, D. M. (2011). Risk factors for hospitalization and severe outcomes of 2009 pandemic H1N1 influenza in Quebec Canada. Influenza and Other Respiratory Viruses, 5(4), 247–255.
Article Google Scholar
Gonzalez-Lopez, J., Ventura, S., & Cano, A. (2019). Distributed selection of continuous features in multilabel classification using mutual information. IEEE Transactions on Neural Networks and Learning Systems, 31(7), 2280–2293.
Google Scholar
Goodman, A. B., Dziuban, E. J., Powell, K., Bitsko, R. H., Langley, G., Lindsey, N., Franks, J. L., Russell, K., Dasgupta, S., & Barfield, W. D. (2016). Characteristics of children aged< 18 years with Zika virus disease acquired postnatally—US States, January 2015–July 2016. Morbidity and Mortality Weekly Report, 65(39), 1082–1085.
Article Google Scholar
Hanslik, T., Boelle, P.-Y., & Flahault, A. (2010). Preliminary estimation of risk factors for admission to intensive care units and for death in patients infected with A (H1N1) 2009 influenza virus, France, 2009–2010. PLoS Currents, 2.
Hartley, M.-A., Young, A., Tran, A.-M., Okoni-Williams, H. H., Suma, M., Mancuso, B., Al-Dikhari, A., & Faouzi, M. (2017). Predicting Ebola severity: A clinical prioritization score for Ebola virus disease. PLoS Neglected Tropical Diseases, 11(2), e0005265.
Article Google Scholar
Holland, J. H. (1992). Genetic algorithms. Scientific American, 267(1), 66–73.
Article Google Scholar
Huang, C.-J., Shen, Y., Kuo, P.-H., & Chen, Y.-H. (2020). Novel spatiotemporal feature extraction parallel deep neural network for forecasting confirmed cases of coronavirus disease 2019. Socio-Economic Planning Sciences. https://doi.org/10.1016/j.seps.2020.100976
Article Google Scholar
Hussain, A., Mahawar, K., Xia, Z., Yang, W., & Shamsi, E.-H. (2020). Obesity and mortality of COVID-19. Meta-Analysis. Obesity Research & Clinical Practice., 14, 295s.
Article Google Scholar
Javanmardi, F., Keshavarzi, A., Akbari, A., Emami, A., & Pirbonyeh, N. (2020). Prevalence of underlying diseases in died cases of COVID-19: A systematic review and meta-analysis. PLoS ONE, 15(10), e0241265.
Article Google Scholar
Jiang, D., Hao, M., Ding, F., Fu, J., & Li, M. (2018). Mapping the transmission risk of Zika virus using machine learning models. Acta Tropica, 185, 391–399.
Article Google Scholar
Johns Hopkins Coronavirus Resource Center. (2021). https://coronavirus.jhu.edu/data/mortality
Kakulapati, V., Sai Sandeep, R., Kranthi kumar, V., & Ramanjinailu, R. (2021). Fuzzy-based predictive analytics for early detection of disease—A machine learning approach BT - ICT systems and sustainability. In M. Tuba, S. Akashe, & A. Joshi (Eds.) (pp. 89–99). Springer Singapore
Kim, J.-H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics & Data Analysis, 53(11), 3735–3745.
Article Google Scholar
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Book A study of crossvalidation and bootstrap for accuracy estimation and model selection (pp. 1137–1145).
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
Google Scholar
Korb, K. B., & Nicholson, A. E. (2010). Bayesian artificial intelligence. CRC Press.
Book Google Scholar
Lee, E. K., Chen, C.-H., Pietz, F., & Benecke, B. (2009). Modeling and optimizing the public-health infrastructure for emergency response. Interfaces, 39(5), 476–490.
Article Google Scholar
Lenzi, L., de Mello, Â. M., da Silva, L. R., Grochocki, M. H. C., & Pontarolo, R. (2012). Pandemic influenza A (H1N1) 2009: Risk factors for hospitalization. Jornal Brasileiro De Pneumologia, 38, 57–65.
Article Google Scholar
Li, X., Wang, L., Yan, S., Yang, F., Xiang, L., Zhu, J., Shen, B., & Gong, Z. (2020). Clinical characteristics of 25 death cases with COVID-19: A retrospective review of medical records in a single medical center, Wuhan, China. International Journal of Infectious Diseases, 94, 128–132.
Article Google Scholar
MacLeod, M. R., & Hunter, D. G. (2021). The impact of age demographics on interpreting and applying population-wide infection fatality rates for COVID-19. INFORMS Journal on Applied Analytics, 51(3), 167–178.
Article Google Scholar
Mahalakshmi, B., & Suseendran, G. (2019). Prediction of Zika virus by multilayer perceptron neural network (MLPNN) using cloud. Int J Recent Technol Eng (IJRTE), 8, 1–6.
Google Scholar
Mantovani, A., Byrne, C. D., Zheng, M.-H., & Targher, G. (2020). Diabetes as a risk factor for greater COVID-19 severity and in-hospital death: A meta-analysis of observational studies. Nutrition, Metabolism and Cardiovascular Diseases, 30(8), 1236–1248.
Article Google Scholar
Mehmanchi, E., Gómez, A., & Prokopyev, O. A. (2021). Solving a class of feature selection problems via fractional 0–1 programming. Annals of Operations Research, 303(1), 265–295. https://doi.org/10.1007/s10479-020-03917-w
Article Google Scholar
Nagurney, A. (2021). Supply chain game theory network modeling under labor constraints: Applications to the Covid-19 pandemic. European Journal of Operational Research, 293(3), 880–891. https://doi.org/10.1016/j.ejor.2020.12.054
Article Google Scholar
Nikolopoulos, K., Punia, S., Schäfers, A., Tsinopoulos, C., & Vasilakis, C. (2021). Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions. European Journal of Operational Research, 290(1), 99–115. https://doi.org/10.1016/j.ejor.2020.08.001
Article Google Scholar
O’Riordan, S., Barton, M., Yau, Y., Read, S. E., Allen, U., & Tran, D. (2010). Risk factors and outcomes among children admitted to hospital with pandemic H1N1 influenza. CMAJ, 182(1), 39–44.
Article Google Scholar
Pandey, M. K., & Subbiah, K. (2018). Performance analysis of time series forecasting using machine learning algorithms for prediction of Ebola casualties. International Conference on Application of Computing and Communication Technologies, (pp 320–334)
Parohan, M., Yaghoubi, S., Seraji, A., Javanbakht, M. H., Sarraf, P., & Djalali, M. (2020). Risk factors for mortality in patients with Coronavirus disease 2019 (COVID-19) infection: a systematic review and meta-analysis of observational studies. The Aging Male, 23, 1–9.
Article Google Scholar
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann.
Google Scholar
Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.
Article Google Scholar
Pearl, J. (2014). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Elsevier.
Google Scholar
Peckham, H., de Gruijter, N. M., Raine, C., Radziszewska, A., Ciurtin, C., Wedderburn, L. R., Rosser, E. C., Webb, K., & Deakin, C. T. (2020). Male sex identified by global COVID-19 meta-analysis as a risk factor for death and ITU admission. Nature Communications, 11(1), 6317. https://doi.org/10.1038/s41467-020-19741-6
Article Google Scholar
Ribeiro, A. F., Pellini, A. C. G., Kitagawa, B. Y., Marques, D., Madalosso, G., de Cassia, N. F. G., Fred, J., Albernaz, R. K. M., Carvalhanas, T. R. M. P., & Zanetta, D. M. T. (2015). Risk factors for death from Influenza A (H1N1) pdm09, State of São Paulo, Brazil, 2009. PLoS ONE, 10(3), e0118772.
Article Google Scholar
Romero Starke, K., Petereit-Haack, G., Schubert, M., Kämpf, D., Schliebner, A., Hegewald, J., & Seidler, A. (2020). The age-related risk of severe outcomes due to COVID-19 Infection: A rapid review, meta-analysis, and meta-regression. International Journal of Environmental Research and Public Health, 17(16), 5974.
Article Google Scholar
Şeref, O., Fan, Y.-J., Borenstein, E., & Chaovalitwongse, W. A. (2018). Information-theoretic feature selection with discrete $$k$$-median clustering. Annals of Operations Research, 263(1), 93–118. https://doi.org/10.1007/s10479-014-1589-3
Article Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
Article Google Scholar
Sinha, P., Kumar, S., & Chandra, C. (2021). Strategies for ensuring required service level for COVID-19 herd immunity in Indian vaccine supply chain. European Journal of Operational Research. https://doi.org/10.1016/j.ejor.2021.03.030
Article Google Scholar
Smith, D. W., & Mackenzie, J. (2016). Zika virus and Guillain-Barré syndrome: Another viral cause to add to the list. The Lancet, 387(10027), 1486–1488.
Article Google Scholar
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437.
Article Google Scholar
Taylor, J. W., & Taylor, K. S. (2021). Combining probabilistic forecasts of COVID-19 mortality in the United States. European Journal of Operational Research. https://doi.org/10.1016/j.ejor.2021.06.044
Thakkar, B. A., Hasan, M. I., & Desai, M. A. (2010). Health care decision support system for swine flu prediction using Naïve Bayes classifier. International Conference on Advances in Recent Technologies in Communication and Computing, 2010, 101–105. https://doi.org/10.1109/ARTCom.2010.98
Article Google Scholar
Topuz, K., & Delen, D. (2021a). A probabilistic Bayesian inference model to investigate injury severity in automobile crashes. Decision Support Systems, 150, 113557.
Article Google Scholar
Topuz, K., Jones, B. D., Sahbaz, S., & Moqbel, M. (2021b). Methodology to combine theoretical knowledge with a data-driven probabilistic graphical model. Journal of Business Analytics, 4(2), 125–139.
Article Google Scholar
Topuz, K., Zengul, F. D., Dag, A., Almehmi, A., & Yildirim, M. B. (2018a). Predicting graft survival among kidney transplant recipients: A Bayesian decision support model. Decision Support Systems, 106, 97–109.
Article Google Scholar
Topuz, K., Uner, H., Oztekin, A., & Yildirim, M. B. (2018b). Predicting pediatric clinic no-shows: A decision analytic framework using elastic net and Bayesian belief network. Annals of Operations Research, 263(1), 479–499.
Article Google Scholar
Ventura, C. V., Maia, M., Travassos, S. B., Martins, T. T., Patriota, F., Nunes, M. E., Agra, C., Torres, V. L., van der Linden, V., & Ramos, R. C. (2016). Risk factors associated with the ophthalmoscopic findings identified in infants with presumed Zika virus congenital infection. JAMA Ophthalmology, 134(8), 912–918.
Article Google Scholar
Wing, K., Oza, S., Houlihan, C., Glynn, J. R., Irvine, S., Warrell, C. E., Simpson, A. J. H., Boufkhed, S., Sesay, A., & Vandi, L. (2018). Surviving Ebola: A historical cohort study of Ebola mortality and survival in Sierra Leone 2014–2015. PLoS ONE, 13(12), e0209655.
Article Google Scholar
Wirth, J. P., Rohner, F., Woodruff, B. A., Chiwile, F., Yankson, H., Koroma, A. S., Russel, F., Sesay, F., Dominguez, E., & Petry, N. (2016). Anemia, micronutrient deficiencies, and malaria in children and women in Sierra Leone prior to the Ebola outbreak-findings of a cross-sectional study. PLoS ONE, 11(5), e0155031.
Article Google Scholar
Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5(3), 161–215.
Article Google Scholar
Xu, Z., Jin, B., Teng, G., Rong, Y., Sun, L., Zhang, J., Du, N., Liu, L., Su, H., Yuan, Y., & Chen, H. (2016). Epidemiologic characteristics, clinical manifestations, and risk factors of 139 patients with Ebola virus disease in western Sierra Leone. American Journal of Infection Control, 44(11), 1285–1290. https://doi.org/10.1016/j.ajic.2016.04.216
Article Google Scholar
Yang, F., Shi, S., Zhu, J., Shi, J., Dai, K., & Chen, X. (2020). Analysis of 92 deceased patients with COVID-19. Journal of medical virology, 92(11), 2511–2515.
Article Google Scholar
Zhang, P., Chen, B., Ma, L., Li, Z., Song, Z., Duan, W., & Qiu, X. (2015). The large scale machine learning in an artificial society: prediction of the Ebola outbreak in Beijing. Computational Intelligence and Neuroscience, 2015.
Zhang, Y., Cui, Y., Shen, M., Zhang, J., Liu, B., Dai, M., & Pan, P. (2020). Association of diabetes mellitus with disease severity and prognosis in COVID-19: a retrospective cohort study. Diabetes research and clinical practice, 165, 108227.
Zolbanin, H. M., Zadeh, A. H., & Davazdahemami, B. (2021). Miscommunication in the age of communication: A crowdsourcing framework for symptom surveillance at the time of pandemics. International Journal of Medical Informatics, 151, 104486.
Article Google Scholar

Download references

Funding

No funding sources to declare.

Author information

Authors and Affiliations

Collins College of Business, School of Finance and Operations Management, The University of Tulsa, Tulsa, USA
Kazim Topuz
Department of IT and Supply Chain Management, University of Wisconsin-Whitewater, 809 W. Starin Rd., Hyland Hall 1222, Whitewater, USA
Behrooz Davazdahemami
Center for Health Systems Innovation, Spears School of Business, Oklahoma State University, Stillwater, USA
Dursun Delen
Faculty of Engineering and Natural Sciences, Istinye University, Istanbul, Turkey
Dursun Delen

Authors

Kazim Topuz
View author publications
You can also search for this author in PubMed Google Scholar
Behrooz Davazdahemami
View author publications
You can also search for this author in PubMed Google Scholar
Dursun Delen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behrooz Davazdahemami.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A. Comprehensive list of comorbidities in overall analysis with “deceased”

Description (node)	Normalized mutual information (%)	Relative significance* (%)	p-value (%)	At risk patients	Min risk patients
Age	7.76	100	0.00	> 78	≤ 23
Number of diagnoses at admission	5.61	72	0.00	> 28	≤ 4
Encounter type	5.24	68	0.00	"Inpatient"	"Admitted for Observation"
PAYER type	4.67	60	0.00	"MEDICARE"	"PPO"
Cardiac arrest (I46.9)	2.80	36	0.00	"1"	"0"
Encephalopathy (G93.4)	1.74	22	0.00	"1"	"0"
Dementia (F03.9)	1.12	14	0.00	"1"	"0"
Atrial fibrillation and atrial flutter (I48.9)	1.07	14	0.00	"1"	"0"
Hypertensive chronic kidney disease (I12.9)	1.06	14	0.00	"1"	"0"
Ethnicity	1.03	13	0.00	"Not Hispanic or Latino"	"Hispanic or Latino"
Type II diabetes mellitus with kidney comp. (E11.2)	1.00	13	0.00	"1"	"0"
Number of historical diagnoses	0.79	10	0.00	> 405	≤ 19
Race	0.68	9	0.00	"Asian or Pacific islander"	"Mixed racial group"
Hypertensive heart and renal disease (I13.0)	0.58	8	0.00	"1"	"0"
Thrombocytopenia (D69.6)	0.49	6	0.00	"1"	"0"
cognitive and awareness malfunction (R41.8)	0.23	3	0.00	"1"	"0"
Hypothyroidism (E03.9)	0.23	3	0.00	"1"	"0"
Disorders of white blood cells (D72.8)	0.21	3	0.00	"1"	"0"
Pure hyperglyceridemia (E78.1)	0.11	1	0.01	"1"	"0"
Gender	0.10	1	0.34	"Male"	"Other"
No-insulin-dependent diabetes mellitus (E11.1)	0.08	1	0.06	"1"	"0"
Conduction disorders (I45.1)	0.07	1	0.18	"1"	"0"
Type II diabetes mellitus without comp. (E11.9)	0.05	1	0.71	"1"	"0"

The variables are arranged according to their Relative Significance (Ri) to the Target Node. *$R_{i} = {\raise0.7ex\hbox{${I\left( {M_{i} ,F} \right)}$} \!\mathord{\left/ {\vphantom {{I\left( {M_{i} ,F} \right)} {\mathop {\max }\limits_{i} I\left( {M_{i} ,F} \right)}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\mathop {\max }\limits_{i} I\left( {M_{i} ,F} \right)}$}} $ where Mi denotes the ith manifest variable, while F denotes the ith factor variable and I(.) returns the Mutual Information.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Topuz, K., Davazdahemami, B. & Delen, D. A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases. Ann Oper Res (2023). https://doi.org/10.1007/s10479-023-05377-4

Download citation

Accepted: 01 May 2023
Published: 17 May 2023
DOI: https://doi.org/10.1007/s10479-023-05377-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases

Abstract

Similar content being viewed by others

The role of artificial intelligence in healthcare: a structured literature review

Revolutionizing healthcare: the role of artificial intelligence in clinical practice

Deep learning modelling techniques: current progress, applications, advantages, and challenges

1 Introduction

2 Background

3 Methods and materials

3.1 Data

3.2 Methodology

3.3 Exploratory phase: feature selection

3.4 Descriptive phase: understanding patterns and inter-relationships

3.5 Explanatory phase: probabilistic prediction model and inference simulator

3.5.1 Probabilistic models and Bayesian belief networks

3.5.2 Understanding uncertainty in probabilistic modeling- mutual information and entropy concepts

3.5.3 Validation framework and performance evaluation

3.5.4 WebSimulator- inference generator

4 Results and discussion

4.1 Summary of exploratory analysis

4.1.1 Feature selection

4.2 Summary of descriptive analysis

4.3 Summary of explanatory analysis

4.3.1 Omnidirectional inference with web simulator

5 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A. Comprehensive list of comorbidities in overall analysis with “deceased”

Appendix A. Comprehensive list of comorbidities in overall analysis with “deceased”

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation