Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

System identifiability in a time-evolving agent-based model

Abstract

Mathematical models are a valuable tool for studying and predicting the spread of infectious agents. The accuracy of model simulations and predictions invariably depends on the specification of model parameters. Estimation of these parameters is therefore extremely important; however, while some parameters can be derived from observational studies, the values of others are difficult to measure. Instead, models can be coupled with inference algorithms (i.e., data assimilation methods, or statistical filters), which fit model simulations to existing observations and estimate unobserved model state variables and parameters. Ideally, these inference algorithms should find the best fitting solution for a given model and set of observations; however, as those estimated quantities are unobserved, it is typically uncertain whether the correct parameters have been identified. Further, it is unclear what ‘correct’ really means for abstract parameters defined based on specific model forms. In this work, we explored the problem of non-identifiability in a stochastic system which, when overlooked, can significantly impede model prediction. We used a network, agent-based model to simulate the transmission of Methicillin-resistant staphylococcus aureus (MRSA) within hospital settings and attempted to infer key model parameters using the Ensemble Adjustment Kalman Filter, an efficient Bayesian inference algorithm. We show that even though the inference method converged and that simulations using the estimated parameters produced an agreement with observations, the true parameters are not fully identifiable. While the model-inference system can exclude a substantial area of parameter space that is unlikely to contain the true parameters, the estimated parameter range still included multiple parameter combinations that can fit observations equally well. We show that analyzing synthetic trajectories can support or contradict claims of identifiability. While we perform this on a specific model system, this approach can be generalized for a variety of stochastic representations of partially observable systems. We also suggest data manipulations intended to improve identifiability that might be applicable in many systems of interest.

1. Introduction

Evaluation and optimization of novel intervention approaches to control infectious diseases in real world settings can be logistically challenging, expensive, and often ethically problematic. Mechanistic models of disease transmission provide an important tool supporting the fight against the spread of infectious diseases, as these systems offer an alternative in silica environment for exploring intervention options. It is thus unsurprising that mathematical models are in widespread use in epidemiological research. These models range from compartmental models using aggregated population states to describe transmission dynamics to high dimensional agent-based models [114] that attempt to represent the individual mixing structure of a population as it might influence the progression of an epidemic. However, models present their own challenges, as the data needed to constrain model simulations are often sparse and incomplete. For instance, observed numbers of individuals with a particular bacterium might correspond to only a small fraction of the true number colonized. In this circumstance, models must be able to reliably represent the hidden progression of an outbreak based on sparse observations.

A model can adequately predict the progression of an outbreak, i.e., disease transmission dynamics, even if mis-specified in some of its assumptions, provided it captures the relationship between observations and state variables of interest. However, the accuracy of those predictions is typically highly dependent on good estimation of model system parameters. Using parameters across different models (for example, using the transmission rate estimated in a previous modeling study) can be problematic because there may be significant differences of same parameters in different systems or under different model assumptions.

To find the parameters relevant for a given model, many researchers augment their models with data that are used to inform parameter selection [1520]. The data are often direct observations of the system state (diagnostic data) or data describing conditions influencing the system state (e.g., meteorological conditions or patient location data). Using these data, infectious disease model parameters that give the best agreement between the observed data and model predictions can be inferred with methods such as Bayesian filters [2123] or gradient descent methods [24, 25]. Underlying this parameter estimation is the issue of parameter identifiability, which, if not addressed properly can distort prediction and system certainty [26, 27]. In essence, identifiability poses the following question: assuming a model is correctly specified, and given the observations in hand, can the true parameters be reliably estimated? While the “true” parameters are typically not observed and remain unknown, system identifiability can be explored by creating synthetic observations using a given model and set of parameters, and then testing whether an inference method, equipped only with the synthetic observations, can consistently estimate the correct parameters.

The issue of identifiability has been explored in a number of fields [2830] including infectious disease modeling. For instance, Sauer et al. demonstrated that parameters of SEIR (susceptible-exposed-infected-recovered) epidemic models are intrinsically unidentifiable early in an epidemic and derived an unidentifiability manifold in the parameter space that consists of parameters indistinguishable from early epidemic curves [27]. Gallo et al. showed that lack of practical identifiability may hamper reliable predictions in compartmental COVID-19 epidemic models and developed a method to characterize different regimes of identifiability [26]. However, existing studies have predominantly focused on compartmental epidemic models described by ordinary differential equations. Identifiability for highly stochastic and high-dimensional agent-based models, which are increasingly used in real-world applications [3137], has not been systematically evaluated.

Indeed, for deterministic models, the question of identifiability is more straightforward and can be defined as either structural non-identifiability (where more than one set of parameters result in the exact same outcome) or practical non-identifiability (where lack of sufficient data or excess noise disrupt identification of most likely parameters). For such deterministic systems, there is a well-established literature on how to identify and, in certain circumstances, even fix issues of identifiability [38, 39]. However, many disease systems are stochastic in nature and better suited for stochastic simulation, often using agent-based models. A stochastic model will produce a distribution of possible trajectories with the same parameters and initial conditions; conversely, the same trajectory might result from a range of different sets of parameters. Such overlap of trajectories produced by different parameters imposes a further challenge when attempting to identify system parameters.

Here we explore the issue of identifiability using a stochastic, agent-based transmission model of Methicillin resistant Staphylococcus Aureus (MRSA), an antimicrobial resistant pathogen prevalent in community and healthcare settings, across 4 hospitals in New York City. Using location data of patients within hospital, we attempted to find the model parameters that best reproduce the number of observed positive MRSA cultures. Specifically, we used an Iterated Ensemble Adjustment Kalman Filter (EAKF) to infer two free parameters: the patient-to-patient transmission rate, β, and the probability a new patient is colonized, γ. We show that while using the EAKF on a specific observed trajectory gives a reliable answer (in the sense that it does not change between runs and appears to converge), using it on different synthetic trajectories derived from the same parameter combination yields different inferred parameters. In addition, we show that the likelihood of observing a particular trajectory given a set of parameters,

Pr(trajectory|β,γ), does not have a global maximum, but instead has a ridge of equally likely values, representing an interplay between the different colonization processes (importation vs. nosocomial transmission) that yields indistinguishable trajectories. We demonstrate the effect of this non-identifiability on the ability to predict hidden variables and assessment of the efficacy of interventions.

2. Methods

2.1 Data

Admission and discharge records for 278,522 distinct patients from 4 hospitals in New York City, New York, USA were used in this study. These data span 1399 consecutive days (02/01/2009 to 11/30/2012). For each patient, records include patient ID, location, admission, and discharge time. The records also include culture data from specimens obtained during hospitalizations, and whether MRSA, among other pathogens, was identified. Across the 4 hospitals, patients were distributed among 315 wards of varying size; ward size, a measure used in the dynamic model, was defined as the mean daily occupancy of the ward (ward size distribution followed a power law with mean 8.7 patients). Patients spent on average 6.3 days in hospital per visit. In total, 1845 positive MRSA cultures were recorded (excluding repeat positive tests obtained during the same hospitalization).

2.2 Model

We developed a network agent-based model in which each agent represents a patient admitted to the hospital, and interactions exist between patients in the same ward on the same day (Fig 1). The patient state can be either colonized or susceptible and transitions occur between the states in daily increments. Transition from colonized to susceptible (decolonization) occurs at a constant rate α[1/day]; however, transition from susceptible to colonized (colonization) occurs for susceptible patients co-located with a colonized patient at the density dependent rate β/Nw, where Nw is the ward size of ward w. We note as wid the ward where patient i was hospitalized on day d, and β[1/day] is the transmission rate. Each patient is already colonized upon admission with probability γ (importation rate). Model dynamics can be summarized by the equation indicating the probability that patient i is colonized on day d, Cid*∈ [0,1]: 0.1 Where Cid-1 is equal to 1 if patient i was colonized on the previous day, and 0 otherwise. hid is equal to 1 if patient i was admitted to the hospital for the first time on day d and 0 otherwise.

thumbnail
Fig 1. Schematic representation of the mechanistic model: A new patient introduced to hospital for the first time is colonized with probability γ.

A susceptible patient in contact with a colonized patient (i.e., in the same ward) has a probability of becoming colonized each day of contact, where is the number of patients in the ward where patient i stayed on day d. A colonized patient can decolonize and become susceptible each day (whether in hospital or not) at probability α. A colonized patient is observed (tests positive for MRSA) each day with probability ρ.

https://doi.org/10.1371/journal.pone.0290821.g001

If a patient is present in more than one location in the hospital network on a given day, he/she is assumed to have contributed to the transmission for each location in inverse proportion to the number of locations visited. That is, if we denote the number of locations patient i visited as kid, Eq 1 is rewritten as: 0.2 where wdi,m is the mth ward patient i visited on day d.

The colonization status of patient i on day d, Cid, is determined to be either 1 or 0 with probabilities Cid* and (1- Cid*) respectively.

Once the colonization status of each patient is determined, each colonized patient is observed (i.e. by positive test result) with probability ρ (See algorithm 1 in the S1 File for more detail).

2.3 Inference method

Our intention is to infer a set of parameters that, given our model, would be most likely to produce the time series of observed positive (colonized) patients. For that, we implemented an Iterated Ensemble Adjustment Kalman Filter (EAKF). In this method, we used the sample covariance between an ensemble of parameters and a corresponding ensemble of variables to adjust the distribution of parameters.

To perform this inference, we used an ensemble of model simulations. Each simulation was initiated with a set of parameters , as well as a corresponding ensemble of colonization states, and observation state , (for each parameter set–a corresponding colonization and observation state for all patients).

Using the model, we progressed the colonization state, , daily for each ensemble member (as described in algorithm 1 in the S1 File) over an assimilation period consisting of d days (Algorithm 2 in S1 File). During this integration, we accumulated the number of observed colonized patients, , for each ensemble member.

After each assimilation period, we used the Ensemble Adjusted Kalman Filter (Algorithm 3 in S1 File) to update the parameters, . Using the updated parameters, we repeated the process over the next time step until the end of our data. At each time step we recorded the values of all parameters sets in the ensemble and took the average over all time step for each ensemble member as the posterior distribution of the inferred parameters.

2.4 Likelihood estimation

To estimate the likelihood that a given set of parameters, θi, produced a given trajectory , we estimated the probability that a simulation with parameter set θi will produce trajectory Oj 0.3

‎We assumed all parameters in range are equally likely. That is, Pr(θi) is constant. For a given trajectory, therefore: 0.4

To estimate we can simply produce many trajectories using θi as parameters and find the ratio of trajectories that are identical to . However, due to the stochastic nature of our system the probability of getting exactly the same number of observations in all 99 assimilation periods is extremely small and would require an unreasonable number of trajectories to produce a robust estimation of the likelihood. Therefore, we assumed that the number of observations between different assimilation periods is uncorrelated and calculate the probability of a trajectory as the product of the probability of each assimilation period separately.

0.5

Even when estimating each assimilation period separately, a significant estimate of is computationally expensive and would give inflated weight to outliers. Assuming the value of is normally distributed (an assumption supported by simulations) we approximated the distribution of as a discretized normal distribution with the mean, and standard deviation, , from the ensemble of synthetic trajectories that were produced using θi.

2.5 Intervention simulation

In a non-identifiable model, the observed state can be adequately explained by different model parameters; however, this non-identifiability is only of practical importance if the different possible parameters predict a significant difference in quantities of interest. In our model, we show that using different parameters yields different hidden variables, like the total number of colonizations (Fig 4). A more salient issue is to explore the effect of a known change imposed on the system with multiple sets of possible parameters and whether those differences are associated with significant changes in observable states, even if the unperturbed system is indistinguishable. This has special relevance in public health where evaluating the effects of interventions is often the goal of a model. a goal.

To explore this issue, we imposed an intervention in the model that causes a proportional decrease in the rate of transmission between patients (i.e. β is reduced such that after the intervention β = kβ). This can be seen as a simplistic representation of enacting hospital wide control policies, such as cleaning, which reduce the probability that colonized patients encounter and infect a susceptible patient.

We can then calculate the effect of the intervention by measuring the difference in observations with and without the intervention.

We do this by comparing pair of trajectories that are identical up to a certain time point (we used 700 days) after which we progress one of these trajectories with the altered transmission rate (kβ) and the other with the unaltered rate (β). We then calculate the difference in total observed cases between the two trajectories, by which we ensure the difference is only due to the intervention. We repeat this exercise running 300 pairs and calculate the mean and variance of the differences. These results are used to estimate the cost-effectiveness of an intervention. By calculating this for different possible parameter sets we examine whether parameter differences produce a difference in the measurable effect of the intervention.

3. Results

Our model system leverages hospital records from the New York Presbyterian hospital (NYPH) system (n = 278,522 patients, 4 hospitals). For each patient the data indicates daily location, i.e., ward, within the hospital. In addition, a small fraction of the patients (1845, about 0.7%) have culture records that were positive for MRSA. This number is below the expected number of patients colonized with MRSA (about 5% in the hospital population based on prior studies [40]). To understand the unobserved progression of MRSA colonization in the hospital we constructed a mechanistic model that describes the importation of MRSA to the hospital by colonized individuals and nosocomial transmission between patients in the same ward during the same day (see Fig 1 in methods). Dynamic progression of the model is dependent on four free parameters; β –the nosocomial transmission rate γ –the importation probability α –the decolonization rate and ρ –the daily probability a colonized patient is detected. Using these parameters, we can simulate the progression of colonization among patients and the corresponding number of positive cultures we would expect to observe, which can be compared to actual observations. Specifically, trajectories are defined as the aggregation of the number of observed positive MRSA cultures in each two-week period over the span of 1399 days for which we have data. Due to model stochasticity, each simulation of the model produces a different trajectory, even if run with the same parameters and initial conditions. Therefore, we look for the set of parameters for which the probability of simulating the observed trajectory is highest.

We calibrated the agent-based model to the numbers of MRSA positive cultures observed every two weeks in NYPH. To simplify the analysis, we will assume knowledge of two parameters, α and ρ using published empirical studies and only try to infer β and γ. For each simulation, we started by drawing values of β and γ from independent uniform distributions over a wide range of possible values (0–0.1 [1/day]]) and drew the colonization status of each patient already admitted on the first day of simulation with probability to be colonized γ. We progressed the colonization status of the patients in the system each day for a period of 14 days and summed the number of positive cultures for that period. An ensemble of 300 simulations was integrated. We used the EAKF (see methods) to update the values of β and γ for each ensemble member based on the covariability of these parameters with the observed state variable (simulated number of positive MRSA cultures). We repeated this process over successive 14-day periods to obtain, theoretically, a distribution of parameters with an improved fit to observations. Once the simulation and updating was completed, we averaged the values of parameters β and γ for each ensemble member across the entire time series. This posterior ensemble was then used as the prior for another iterative assimilation of the same time series [41].

Fig 2 shows how the posterior distributions for γ and β change over 20 iterations. Parameter values stabilize around a solution of β = 0.015 and γ = 0.043.

thumbnail
Fig 2.

Inference results: Panels A and B) Solid lines with stars show the mean value of the posterior distribution of β and γ respectively, at the end of each itteration of the EAKF. Each color (red, orange and green) represents a different realization of the inference process. The shaded area represents the ensemble 90% confidence interval around the mean of the parameter values across all time steps. The inference was made using the hospital record observations. Dashed lines indicate the values at the final iteration. Values stabilize around β = 0.015 and γ = 0.043 (where the values for α = 1.5/365 and ρ = 0.016 were assumed). Panel C) Synthetic trajectories simulated using the inferred parameters. The black line is the observed trajectory taken from hospital records and used to perform the inference; the red and blue lines are two synthetic trajectories chosen at random from 300 simulations. The shaded area shows the 90% confidence interval around the mean of the number of observations at each time step for all 300 synthetic trajectories.

https://doi.org/10.1371/journal.pone.0290821.g002

The quality of these estimates was further tested by using the inferred parameters in simulation. Due to the stochastic nature of the model system, the simulated trajectories vary even when using the same parameters. The distribution of trajectories generated from the inferred parameters (Fig 2C) is centered around the observed trajectory which is mostly within the 90% confidence interval of simulations. By its nature, the EAKF inference will gravitate towards parameter values which are more likely to produce agreement with the observations; however, this process might fail to identify the unique result with the highest likelihood. For a high-dimensional, agent-based system, the likelihood landscape may have a complex structure with multiple local maxima. The stochastic noise associated with the dynamics can also obscure the landscape. It is therefore crucial to develop a method to discern whether the optimal solution has been identified.

Calculating the likelihood of all possible parameter combinations at sufficient density is not feasible for many models; however, for our model there are only two free parameters and scanning over this parameter space is possible. Therefore, we performed a likelihood analysis over a large surface of parameters (Fig 3A).

thumbnail
Fig 3.

Likelihood landscape: A) Heat map of the Log-likelihood of the observed trajectory, Pr(Tobserved|γ,β), across the parameter landscape (see S1 File) derived from an ensemble of 300 simulations. Each simulation is a realization of the model using the values of γ and β indicated on the x and y axis respectively. Dashed white line represents a weighted linear fit of β as a function of γ, where the weight of each coordinate is given by its likelihood. The resulting line equation is indicated above the line in white. Panel B) Likelihood along the ridge. Blue line shows likelihood score of observing the real trajectory given parameters along the ridge. For γ>0.025 the likelihood flattens. The likelihood landscape is characterized by a “ridge” along which the slope is very flat compared to the area perpendicular to the ridge. Red lines show the distribution (middle line is the mean; error bars show 90% confidence interval) of the likelihood, P(Ti,γ,β|γ,β), calculated for each of 200 synthetic trajectories Ti,γ,β generated with the same parameters (i.e., the expected likelihood for stochastic trajectories generated using these parameters). For small values of γ, the blue line falls below the CI, indicating that the observed trajectory is unlikely compared to the trajectories simulated with those parameters.

https://doi.org/10.1371/journal.pone.0290821.g003

Based on the likelihood landscape, the majority of parameter combinations in the parameter space can be excluded due to low likelihood. However, the landscape is characterized by a narrow ridge where likelihood is similarly high. This ridge can be approximated by a line equation β = 0.0623–1.079γ. The likelihood along the ridge is almost constant for γ>0.03 (Fig 3B), indicating a lack of distinction among parameter combinations. The fact that these parameters fall on a line implies an interplay between two processes: a reduction of importation rate combined with an appropriate increase of nosocomial transmission rate produces simulated trajectories that are similarly and indistinguishably well-fit to observations. This finding indicates that the system parameters are not fully identifiable.

While different points along the ridge represent scenarios where the expected observations are indistinguishable, the underlying realities they represent might be significantly different. Fig 4 shows three important metrics for four different parameter combinations along the ridge. As expected, the number of observations is very similar, however, the composition of the source of colonization varies from all colonizations imported to almost 30% of colonizations resulting from nosocomial transmission (Fig 4C). For a public health expert attempting to develop targeted interventions, these two scenarios, which equally account for the observations, could point to different conclusions. Even the total number of underlying colonizations, which might be expected to be proportional to the number of observations, changes by 80% along the ridge (Fig 4B). This variation manifests because patients who are hospitalized for a short time are just as likely to be colonized upon admission but are less likely to become infected and be observed during a shorter hospital stay.

thumbnail
Fig 4.

Different realities: Simulations using 4 parameter combinations that reside on the ridge of equal likelihood described in Fig 3: Blue—β = 0.029, γ = 0.03; orange—β = 0.02, γ = 0.04; green—β = 0.01, γ = 0.05; red—β = 10−3, γ = 0.06; For each parameter combination simulation was repeated 300 times. Colored solid lines and shaded areas represent the mean and 90% CI of the variable trajectories. Panel A) shows the number of observed positive cultures. The black line shows the number of positive results for the hospital data. All 4 parameter combinations produce an indistinguishable prediction of this variable. Panel B) shows the average number of total colonizations. Even though the daily observation ratio is the same (ρ = 0.016) and the number of observations is the similar, the underlying number of colonized patients decreases with increasing β. This illustrates that when more colonizations are due to nosocomial transmission they are more likely to be observed. Panal C) shows the average ratio of new cases between importation and nosocomial infections for the ensemble shown in panel A. This ratio also decreases with increasing β.

https://doi.org/10.1371/journal.pone.0290821.g004

The misrepresentation of the true system state is not academic. As can be seen in Fig 5, simulations using different but equally likely parameter combinations can result in very different predictions of intervention effectiveness. In this example, we tested the effect of an intervention that would reduce the nosocomial transmission rate. Such intervention is likely to involve significant investment of resources and public health decision makers might choose to use models to estimate the impact of this intervention on the number of cases. It can be seen that using different parameters which are equally likely to explain observations yield very different predictions of the efficacy of the intervention. This example is, of course, an arbitrary choice of intervention, and is only meant to show that even if observations are not affected by the choice of parameters, these parameters might cause the system to deviate substantially once the system is perturbed.

thumbnail
Fig 5. Different effects of intervention: We simulate an intervention that decreases the nosocomial transmission rate by a factor of 70% (β→0.3β) starting from January 2011.

Parameter sets are on the ridge of equal likelihood and are the same as used in Fig 4. Panel A) shows the mean trajectories of positive tests with (dashed lines) and without (solid lines) intervention. The trajectories without intervention are indistinguishable, but the effect of the intervention varies significantly. Panel B) shows the difference between the total number of positive cases with and without intervention for each of the parameter sets. The difference was calculated from 300 simulations with each parameter combination; circles and error bars show the mean and 90% CI.

https://doi.org/10.1371/journal.pone.0290821.g005

In more complicated systems with more than two parameters, such parameter space screening is impossible. However, we can learn more about the identifiability of our model-inference system by applying it to a relatively small number of synthetic trajectories. First, we applied the inference method to trajectories simulated with the parameters inferred from the actual culture records. If inference with these synthetic trajectories consistently were to yield the correct parameters, it would strengthen confidence that the system is identifiable because it would show, provided the estimated parameters represent reality accurately, the system can reliably retrieve them using our inference algorithm. However, for our system, as shown in Fig 6, inference using 3 different trajectories generated with the same parameters results in different parameter estimates. Despite this variance, all results still fall along the ridge shown in Fig 3.

thumbnail
Fig 6.

Inference over different trajectories from same parameters: Panel A) Using the parameters inferred with the real observations (β = 0.015 and γ = 0.043) we simulated 3 trajectories, T1(blue), T2(orange) and T3(green). The black line is the observed trajectory. Panel B) the posterior distribution of each trajectory in panel A produces 3 different solutions: θ1 = [0.040, 0.017] (blue circle), θ2 = [0.048, 0.015] (red diamond) and, θ3 = [0.043, 0.099] (green triangle) corresponding to T1, T2 and T3 respectively. Black star indicates the parameter values used to simulate trajectories T1, T2 and T3. Panel C) Likelihood analysis. Each point represents the Likelihood P(Ti|θj) where a blue circle signifies i = 1, orange diamond—i = 2 and green triangle represents i = 3. Lines represents the distribution of where is one of 300 synthetic trajectories that were simulated using the parameter set θj. Black stars indicate the likelihood of the observed trajectory Pr(Tobserved|θj). Each trajectory is almost equally likely to be simulated from any of the inferred parameters.

https://doi.org/10.1371/journal.pone.0290821.g006

In a perfect world, additional and more targeted testing can be performed to help differentiate different model parameters; however, many modeled systems are supported by limited data. An informed segmentation of such limited observations can help improve system identifiability by delineating where differences arise between equally likely parameter sets. In our system, it is reasonable to assume that if colonizations are primarily caused by nosocomial transmission (larger β, smaller γ), fewer positive results would appear during the first days in hospital, before a patient has contacted potentially infectious peers. Therefore, it may be possible to better distinguish parameters by fitting two time series of positive results: for the first few days following admission and later during the patient stay. In Fig 7 we see how this segmentation enables inference of a substantial reduction of the length of the ridge of equally likely parameters. It must be stated that the results shown in Fig 7 is one example. The success of this method changes dramatically for different parameters and even for different synthetic trajectories of the same parameter set. Further examination, calibration and improvement of this approach are needed and are the subject of ongoing work.

thumbnail
Fig 7.

Panels A-D) Heat map of the Log-likelihood of the observed trajectory, ℒ(T) = Pr(T|γ,β), across the parameter landscape (see S1 File) derived from an ensemble of 200 simulations, similar to Fig 3A. Unlike Fig 3, the likelihood was calculated for a synthetic trajectory with known parameters γ = 0.043 and β = 0.015 (marked as the white circle in each panel) A black square represents the maximum likelihood estimate. Panel A) shows the likelihood to observe the trajectory of total number of positive tests Ttot, as in Fig 3. The likelihood landscape is characterized by a long and narrow ridge of similar likelihood. Panel B) shows the likelihood of observation of the positive tests that were performed only within the first two days of hospitalization, TD2. This likelihood surface is almost flat across different values of beta while showing a relatively sharp peak in γ. This is to be expected as observations in the first two days are almost completely attributable to imported cases. Panel C) shows likelihood of observations of positive tests that were preformed later than the second day of admission, TA2. The likelihood landscape appears similar to that of total number of observations. Panel D) shows that joint probability to observe both trajectories TD2 and TA2 simultaneously. It can be seen that the likelihood landscape exhibits a much sharper slope around the peak which is at the correct parameter coordinates. For better visualization of the peaks and plateaus mentioned above, Panel E and F) shows the likelihood profiles for β (max{ℒ(T)}β) and γ (max{ℒ(T)}β), respectively, where T is Ttot (gray solid line), TD2 (dotted green line), TA2 (dash-dotted red line) or the combination of TD2 and TA2 (dashed blue line).

https://doi.org/10.1371/journal.pone.0290821.g007

4. Discussion

In this work we have explored the identifiability for model parameters in a stochastic agent-based model. We have shown that the model-inference system produced consistent parameter convergence and good agreement with observations by simulations using those parameters. However, the inference system cannot fully identify the unique parameters used for generating synthetic outbreaks. Instead, the estimated parameters tend to lie on a ridge of parameter space with near equal likelihood. Our analysis indicates that the inference system can only identify a region in the parameter space that may include the true parameters with some level of uncertainty.

The identifiability of a system can be affected by four issues. The first is the identifiability of the model, which can be influenced by the complexity of the system (i.e., the number of parameters and interplay between them). Our findings can be reasonably interpreted as an example of this issue. The linear dependance between the inferred parameters along the ridge indicates that an interplay between the parameters may be the cause of unidentifiability. We used a simple model with only two free parameters to illustrate this, but this finding is likely not unique to simple systems and may be exacerbated with additional degrees of freedom. A second aspect affecting identifiability is the degree of stochasticity in the system. In stochastic systems an ambiguity between parameters and observations is inherent. If the spread of observations for simulations generated with different parameters overlap considerably, identifiability might become impossible.

The third component affecting identifiability is the richness of data. The quantity and structure of observations are key for distinguishing, within a given model, between different parameters. If data are sparse or possess too little signal, they may be explained by very different realizations equivalently well. In our example, increasing the quantity of observations improves the probability of finding the correct parameters; however, the system still may not find the unique model parameters due to interplay between parameters and stochasticity (see S1 File, where the likelihood ridge remains even when the probability of observation, ρ, is increased tenfold). The fourth issue is the choice of inference algorithm. There exist many inference algorithms, some of which might be more or less suitable for different systems, and within each algorithm there are often different choices that can be made to improve the performance of the inference (e.g., informed selection of priors, controlled convergence, etc.). However, in systems like the one presented here, a different algorithm would not have produced better parameter inference because there is no theoretical way to distinguish between different parameters given the model, system stochasticity, and data available.

While the quantity of data available for modeling is usually fixed, identifiability might be improved by manipulating the structure of the data to create differentiating observables. As we show in Fig 7, an improvement in identifiability can be achieved by segmenting observation according to length of stay. Specifically, instead of fitting the model to one time series of observed incidences among all patients, we fitted the model to two time series representing the observed incidences within and after 2 days of patients’ admission. The area in the parameter space with a high likelihood was reduced by rearranging the observation data, indicating an improved parameter identifiability. This is an example of what can be achieved by an informed segmentation of observations. Other possible segmentations (for example, by ward) and calibration of the likelihood can further improve identifiability. Another option is to calculate indicators of the spread of observations within the population. In our example, we might calculate the mixing entropy of the distribution of observations between wards (entropy is a common measure of the randomness of dispersion). This type of observation might improve identifiability if there is more nosocomial infection than importation, in which case we would expect observations to be more clustered. A more definitive examination of these data manipulation approaches is an expected part of future work.

The level to which a modeler needs to distinguish between different parameters is dependent on the questions being addressed. For example, in our system we are unable to identify the unique true parameters that give the correct fraction of nosocomial transmission; however, if we are simply interested in the qualitative question of the role of importation in MRSA incidence, the results for the synthetic outbreak in Fig 4C give a consistent answer that the majority of (>50%) cases are imported. Therefore, a useful definition of identifiability might be the ability to exclude enough of parameter space (i.e. reduce uncertainty in estimated parameters) such that the quantity in which we are interested is sufficiently constrained.

Supporting information

S1 File. Pseudo-code.

Pseudo-codes for algorithms used for computation in this paper.

https://doi.org/10.1371/journal.pone.0290821.s001

(DOCX)

References

  1. 1. Deboscker S, Séverac F, Gaudart J, Ménard C, Meyer N, Lavigne T. An agent-based model to simulate the transmission of vancomycin-resistant enterococci according different prevention and control measures. Infection Control & Hospital Epidemiology. 2021;42: 857–863. pmid:33336639
  2. 2. Pei S, Liljeros F, Shaman J. Identifying asymptomatic spreaders of antimicrobial-resistant pathogens in hospital settings. Proceedings of the National Academy of Sciences. 2021;118: e2111190118. pmid:34493678
  3. 3. Perkins TA, Jr RCR, España G, Bosch QA ten, Verma A, Liebman KA, et al. An agent-based model of dengue virus transmission shows how uncertainty about breakthrough infections influences vaccination impact projections. PLOS Computational Biology. 2019;15: e1006710. pmid:30893294
  4. 4. España G, Grefenstette J, Perkins A, Torres C, Campo Carey A, Diaz H, et al. Exploring scenarios of chikungunya mitigation with a data-driven agent-based model of the 2014–2016 outbreak in Colombia. Sci Rep. 2018;8: 12201. pmid:30111778
  5. 5. Grundmann H, Hellriegel B. Mathematical modelling: a tool for hospital infection control. The Lancet Infectious Diseases. 2006;6: 39–45. pmid:16377533
  6. 6. van Kleef E, Robotham JV, Jit M, Deeny SR, Edmunds WJ. Modelling the transmission of healthcare associated infections: a systematic review. BMC Infectious Diseases. 2013;13: 294. pmid:23809195
  7. 7. Slayton RB, Toth D, Lee BY, Tanner W, Bartsch SM, Khader K, et al. Vital Signs: Estimated Effects of a Coordinated Approach for Action to Reduce Antibiotic-Resistant Infections in Health Care Facilities—United States. MMWR Morb Mortal Wkly Rep. 2015;64: 826–831. pmid:26247436
  8. 8. Paul P, Slayton RB, Kallen AJ, Walters MS, Jernigan JA. Modeling Regional Transmission and Containment of a Healthcare-associated Multidrug-resistant Organism. Clinical Infectious Diseases. 2020;70: 388–394. pmid:30919885
  9. 9. Toth DJA, Khader K, Slayton RB, Kallen AJ, Gundlapalli AV, O’Hagan JJ, et al. The Potential for Interventions in a Long-term Acute Care Hospital to Reduce Transmission of Carbapenem-Resistant Enterobacteriaceae in Affiliated Healthcare Facilities. Clinical Infectious Diseases. 2017;65: 581–587. pmid:28472233
  10. 10. Smith DL, Levin SA, Laxminarayan R. Strategic interactions in multi-institutional epidemics of antibiotic resistance. Proceedings of the National Academy of Sciences. 2005;102: 3153–3158. pmid:15677330
  11. 11. Rubin MA, Jones M, Leecaster M, Khader K, Ray W, Huttner A, et al. A Simulation-Based Assessment of Strategies to Control Clostridium Difficile Transmission and Infection. PLOS ONE. 2013;8: e80671. pmid:24278304
  12. 12. Worby CJ, Jeyaratnam D, Robotham JV, Kypraios T, O’Neill PD, De Angelis D, et al. Estimating the Effectiveness of Isolation and Decolonization Measures in Reducing Transmission of Methicillin-resistant Staphylococcus aureus in Hospital General Wards. American Journal of Epidemiology. 2013;177: 1306–1313. pmid:23592544
  13. 13. Cooper BS, Medley GF, Stone SP, Kibbler CC, Cookson BD, Roberts JA, et al. Methicillin-resistant Staphylococcus aureus in hospitals and the community: Stealth dynamics and control catastrophes. Proceedings of the National Academy of Sciences. 2004;101: 10223–10228. pmid:15220470
  14. 14. Bootsma MCJ, Diekmann O, Bonten MJM. Controlling methicillin-resistant Staphylococcus aureus: Quantifying the effects of interventions and rapid diagnostic testing. Proceedings of the National Academy of Sciences. 2006;103: 5620–5625. pmid:16565219
  15. 15. Wangdi K, Singhasivanon P, Silawan T, Lawpoolsri S, White NJ, Kaewkungwal J. Development of temporal modelling for forecasting and prediction of malaria infections using time-series and ARIMAX analyses: A case study in endemic districts of Bhutan. Malaria Journal. 2010;9: 251. pmid:20813066
  16. 16. Jones SS, Evans RS, Allen TL, Thomas A, Haug PJ, Welch SJ, et al. A multivariate time series approach to modeling and forecasting demand in the emergency department. Journal of Biomedical Informatics. 2009;42: 123–139. pmid:18571990
  17. 17. Pei S, Morone F, Liljeros F, Makse H, Shaman JL. Inference and control of the nosocomial transmission of methicillin-resistant Staphylococcus aureus. Cooper B, Jha P, editors. eLife. 2018;7: e40977. pmid:30560786
  18. 18. Cooper BS, Medley GF, Bradley SJ, Scott GM. An Augmented Data Method for the Analysis of Nosocomial Infection Data. American Journal of Epidemiology. 2008;168: 548–557. pmid:18635575
  19. 19. Thomas A, Redd A, Khader K, Leecaster M, Greene T, Samore M. Efficient parameter estimation for models of healthcare-associated pathogen transmission in discrete and continuous time. Mathematical Medicine and Biology: A Journal of the IMA. 2015;32: 81–100. pmid:24114068
  20. 20. Thomas A, Khader K, Redd A, Leecaster M, Zhang Y, Jones M, et al. Extended models for nosocomial infection: parameter estimation and model selection. Mathematical Medicine and Biology: A Journal of the IMA. 2018;35: i29–i49. pmid:29040678
  21. 21. Stano P, Lendek Z, Braaksma J, Babuška R, de Keizer C, den Dekker AJ. Parametric Bayesian Filters for Nonlinear Stochastic Dynamical Systems: A Survey. IEEE Transactions on Cybernetics. 2013;43: 1607–1624. pmid:23757593
  22. 22. Wang S, Yang X, Li L, Nadler P, Arcucci R, Huang Y, et al. A Bayesian Updating Scheme for Pandemics: Estimating the Infection Dynamics of COVID-19. IEEE Computational Intelligence Magazine. 2020;15: 23–33.
  23. 23. Calvetti D, Hoover A, Rose J, Somersalo E. Bayesian particle filter algorithm for learning epidemic dynamics. Inverse Problems. 2021;37: 115008.
  24. 24. Chen X, Lee JD, Tong XT, Zhang Y. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics. 2020;48: 251–273.
  25. 25. Wang Z, Zhang X, Teichert GH, Carrasco-Teja M, Garikipati K. System inference for the spatio-temporal evolution of infectious diseases: Michigan in the time of COVID-19. Comput Mech. 2020;66: 1153–1176. pmid:35194281
  26. 26. Gallo L, Frasca M, Latora V, Russo G. Lack of practical identifiability may hamper reliable predictions in COVID-19 epidemic models. Science Advances. 2022;8: eabg5234. pmid:35044820
  27. 27. Sauer T, Berry T, Ebeigbe D, Norton MM, Whalen AJ, Schiff SJ. Identifiability of Infection Model Parameters Early in an Epidemic. SIAM J Control Optim. 2022;60: S27–S48. pmid:36338855
  28. 28. Anstett-Collin F, Denis-Vidal L, Millérioux G. A priori identifiability: An overview on definitions and approaches. Annual Reviews in Control. 2020;50: 139–149.
  29. 29. Walter E, Lecourtier Y. Global approaches to identifiability testing for linear and nonlinear state space models. Mathematics and Computers in Simulation. 1982;24: 472–482.
  30. 30. Browning AP, Warne DJ, Burrage K, Baker RE, Simpson MJ. Identifiability analysis for stochastic differential equation models in systems biology. Journal of The Royal Society Interface. 2020;17: 20200652. pmid:33323054
  31. 31. Heppenstall A, Crooks A, Malleson N, Manley E, Ge J, Batty M. Future Developments in Geographical Agent-Based Models: Challenges and Opportunities. Geographical Analysis. 2021;53: 76–91. pmid:33678813
  32. 32. Castro J, Drews S, Exadaktylos F, Foramitti J, Klein F, Konc T, et al. A review of agent-based modeling of climate-energy policy. WIREs Climate Change. 2020;11: e647.
  33. 33. Cuevas E. An agent-based model to evaluate the COVID-19 transmission risks in facilities. Computers in Biology and Medicine. 2020;121: 103827. pmid:32568667
  34. 34. Hoertel N, Blachier M, Blanco C, Olfson M, Massetti M, Rico MS, et al. A stochastic agent-based model of the SARS-CoV-2 epidemic in France. Nat Med. 2020;26: 1417–1421. pmid:32665655
  35. 35. Mls K, Kořínek M, Štekerová K, Tučník P, Bureš V, Čech P, et al. Agent-based models of human response to natural hazards: systematic review of tsunami evacuation. Nat Hazards. 2023;115: 1887–1908. pmid:36212893
  36. 36. Bolton ER, Berglund EZ. Agent-based modeling to assess decentralized water systems: Micro-trading rainwater for aquifer recharge. Journal of Hydrology. 2023;618: 129151.
  37. 37. Pleyer J, Fleck C. Agent-based models in cellular systems. Frontiers in Physics. 2023;10. Available: https://www.frontiersin.org/articles/10.3389/fphy.2022.968409
  38. 38. Villaverde AF, Pathirana D, Fröhlich F, Hasenauer J, Banga JR. A protocol for dynamic model calibration. Briefings in Bioinformatics. 2022;23: bbab387. pmid:34619769
  39. 39. Wieland F-G, Hauber AL, Rosenblatt M, Tönsing C, Timmer J. On structural and practical identifiability. Current Opinion in Systems Biology. 2021;25: 60–69.
  40. 40. General Information | MRSA | CDC. 26 Jun 2019 [cited 18 Jun 2022]. Available: https://www.cdc.gov/mrsa/community/index.html
  41. 41. Ionides EL, Bretó C, King AA. Inference for nonlinear dynamical systems. Proceedings of the National Academy of Sciences. 2006;103: 18438–18443. pmid:17121996