Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Modeling latent spatio-temporal disease incidence using penalized composite link models

Abstract

Epidemiological data are frequently recorded at coarse spatio-temporal resolutions to protect confidential information or to summarize it in a compact manner. However, the detailed patterns followed by the source data, which may be of interest to researchers and public health officials, are overlooked. We propose to use the penalized composite link model (Eilers PCH (2007)), combined with spatio-temporal P-splines methodology (Lee D.-J., Durban M (2011)) to estimate the underlying trend within data that have been aggregated not only in space, but also in time. Model estimation is carried out within a generalized linear mixed model framework, and sophisticated algorithms are used to speed up computations that otherwise would be unfeasible. The model is then used to analyze data obtained during the largest outbreak of Q-fever in the Netherlands.

Introduction

In recent decades, the development of spatio-temporal statistical methods in Epidemiology, the cornerstone of Public Health, has been remarkable. Such development has been mainly driven by advances in geographic information systems (GISs), access to reliable health data registers, and the availability of powerful software capabilities to process and analyze large amounts of data. The methodological contributions to the analysis of spatio-temporal health data come from several interdisciplinary researchers, whose backgrounds are mostly related to Statistics, Geography, Environmental Sciences, and Epidemiology.

Within the diversity of of epidemiological research, disease mapping has attracted much interest in Public Health as it helps to visualize disease incidence of mortality risk patterns in a specific area. To this end, appropriate statistical methods have been applied to health data, which are usually recorded per spatial units, to provide smoothed disease incidences per unit. Smoothing is performed to obtain more stable and less noisy estimates of the incidence rates associated with each unit [1], which helps to determine meaningful spatial patterns. By adding the temporal dimension to this context, it is possible to examine the evolution of disease incidence in each unit, during a certain period of time (generally divided in years), but it implies a challenge for smoothing data, in terms of computational time and storage, even more if disaggregation is also needed in time. Several techniques have been proposed for the spatio-temporal smoothing of health data; most of which are developed under an empirical Bayes approach, where B-splines are used [2, 3] or a hierarchical Bayesian framework where conditional autoregressive (CAR) structures are included [46]. Regarding this last approach, methods that use integrated nested Laplace approximations (INLA, [7]) have recently been proposed (see [810]; among others).

All the works mentioned above provide smoothed estimates that change over time within a unit and across units within time interval, although each estimated value is assigned to the whole unit. Furthermore, most of them can be extended to include covariates or explanatory variables, which must have the same spatio-temporal resolution as health data. Therefore, they restrict the incorporation of population information at fine-scale and other relevant risk factors recorded at a finer resolution (see the modifiable areal unit problem; [11, Ch. 29]). The challenge here is not only that the map is dynamic, but the fact that, in order to obtain a better insight of the evolution of the incidence, detailed maps are needed at a finer spatial and temporal scale than the one provided (for example, grid instead of areas and weeks instead of months) while maintaining the coherence with the aggregated observed counts. To overcome the limitations of previous works, we propose the use of the spatio-temporal penalized composite link model(ST-PCLM) to estimate the latent distribution of spatio-temporally grouped count data. Our proposal is based on the penalized composite link model of [12], combined with a spatio-temporal P-spline methodology [13], to obtain smoothed estimates at a finer resolution from data aggregated over space and/or time. Moreover, the approach allows to include population information at fine-scale and specific random effects or further correlation structure, if necessary. This could be in the form of unstructured variation (including random effects per spatial unit) or structured variation by assuming a known variance-covariance matrix for the spatial random effects, for example using a CAR model as in [14].

As in any latent observation problem, there are infinitely possible solutions that fit the observed data. In that sense, the model is not identifiable, and the use of some prior knowledge or constraints (based on experience or common sense) is required for the problem to be well posed. Because we want to estimate a distribution, it is reasonable to expect a relatively smooth result, so we assume here that the underlying spatio-temporal process behind aggregated data is smooth and stationary (although this last hypothesis could also be relaxed by using multidimensional adaptive smoothing as in [15]). The flexibility of the model is given by the use of B-spline bases and a discrete penalty on the regression coefficients, following the P-spline methodology [16]. Smoothness is controlled by three smoothing parameters (two for the spatial dimension and one for the temporal dimension) that have to be estimated along with the overall trend at the fine scale. To perform the spatio-temporal disaggregation, the ST-PCLM uses a composition matrix that links both coarse and fine resolutions, and which is expressed as a Kronecker product of two marginal composition matrices (one acting at a spatial level and another at temporal level).

We can find several techniques that allow the spatial disaggregation of health data. For example, the use of an interpolation process from empirical Bayes estimates [17], (generalized) Poisson kriging methods [18], log-Gaussian Cox processes [19], and spatial composite link mixed models [20] have been suggested to produce a continuous smooth surface (across the study area) from regional health data. However, as far as we know, there is no appropriate model to address the problem of disaggregation of health data in both space and time (although there are some works about spatio-temporal disaggregation in a Gaussian context; see, for example, [2125]). A common among social scientist is geographical microsimulation (see for example [26]). Geographical microsimulation models simulate populations in given geographical areas, with population characteristics as close as possible to their real counterparts. But a drawback may be that a representative sample of individuals at the fine-scale data required by those methods is not always available. The methodology presented here allows to create detailed dynamic maps for disease incidence data. Publicly available maps in aggregated form over space and time are the only information required, as well as geographical locations and time points where a finer resolution prediction is required.

Materials and methods

Q fever data

Q fever is a widespread zoonotic disease caused by the bacterium Coxiella burnetii. C. burnetii transmission to humans is mainly associated with ruminants such as cattle, sheep, and goats. During parturition or abortion of infected animals, high numbers of C. burnetii are shed within the amniotic fluids and the placenta. These organisms end up in the environment, where they may survive for long periods due to their resistance to heat, dryness, and many common disinfectants. Humans are often highly susceptible to the disease, and very few organisms may be necessary to cause infection. More information about this infectious disease is provided in [27].

The Southern Netherlands faced large outbreaks of human Q fever from 2007 to 2010 [28]. In this country, the local municipal health services (MHSs) are responsible for registering all confirmed diagnoses of acute Q fever. The information collected is then entered into the electronic national infectious diseases surveillance database. Due to confidentiality, these data are not publicly available and, in some instances, maybe provided in an aggregated form. In this case, the data were made available monthly at the municipality level.

Fig 1 shows the temporal distribution of Q fever cases (in months) from January 2007 to July 2010. A total of 3806 acute Q fever cases were registered in this period: 192 in 2007, 980 in 2008, 2309 in 2009, and 325 in 2010. The epidemic peaks of each year were observed every spring, specifically during May. This coincides with the birth period of small ruminants (sheep and goats), a fact that was pointed out in several studies about those exceptionally large Q fever outbreaks in the Netherlands (see, for example, [28, 29]). Since the largest outbreak was observed during 2009, we have studied the distribution of Q fever incidence in that year.

thumbnail
Fig 1. Human Q fever cases in the Netherlands grouped per months, from January 2007 to July 2010.

https://doi.org/10.1371/journal.pone.0263711.g001

Fig 2a shows postal code areas affected by human Q fever (red points) in 2009. However, the number of cases was only publicly available at the municipality level. Since dairy goat and sheep farms are not evenly distributed across the Netherlands, Q-fever infection did not occur in all areas of the country. In this study we have focused on a 60 × 60km area in the south of the Netherlands (see black square in Fig 2a) with a total of 72 municipalities. The total number of Q fever cases reported in these municipalities was 1798. Taking into account the number of inhabitants in each municipality, we can calculate Q fever incidence (per 100000 inhabitants). Fig 2b shows the spatial distribution of the resulting Q fever incidence (aggregated over months in 2009), with higher incidence values observed around the municipalities of Landerd (1439.676), Lith (562.546), and Heusden (295.006). Both Figs 1 and 2 show a smooth distribution of counts in space and time which makes our initial assumption of a smooth latent distribution plausible.

thumbnail
Fig 2. Map of human Q fever cases in the Netherlands, 2009.

Left: red points indicate the residential addresses of human cases (2309 in total). Right: study area in the south of the Netherlands showing (raw) incidence (per 100000 inhabitants) of Q fever by municipality in 2009. Source: geo-referenced data are obtained from https://www.cbs.nl/en-gb/onze-diensten/open-data/statline-as-open-data/cartography. a) Q fever outbreaks in 2009. b) Q fever incidence in study area.

https://doi.org/10.1371/journal.pone.0263711.g002

The spatio-temporal penalized composite link model

The Spatio-Temporal Penalized Composite Link Model (ST-PCLM) is the name of the model we propose for the spatio-temporal disaggregation of epidemiological data. ST-PCLM combines the penalized composite link model (PCLM) of [12] and spatio-temporal smoothing with P-splines [13]. The PCLM generalizes the model proposed by [30] and allows the estimation of a smooth trend, from grouped data, at a finer resolution. The idea behind it is as follows:

We observe data y from a Poisson distribution with mean μ, then we assume that μ is a composition of the ungrouped distribution γ. The approach assumes that the underlying (ungrouped) distribution is smooth, but otherwise let the data determine their shape.

As PCLM applies to the unidimentional scenario, we develop it in the spatio-temporal context. Then, we have included the ideas presented in [13] to reformulate the new spatio-temporal PCLM under a generalized linear mixed model (GLMM) framework, i.e., ST-PCLM. Hereafter, we will show the development of the ST-PCLM and a computationally efficient way to estimate the ST-PCLM parameters.

The model.

Let yit, i = 1, …, n, t = 1, …, Ta, denote count data recorded over n non-overlapping spatial units vi, which constitute the area of interest, at Ta time periods. Suppose that we want to estimate the latent distribution of the vector of counts at a spatio-temporal support that is a refinement of the original one. The fine support is determined by three coordinates: x1 = (x11, …, x1m)′ and x2 = (x21, …, x2m)′, with m > n, which represent the longitude and latitude coordinates of the spatial refinement, respectively; and , with Tf > Ta, which represents the coordinates of the temporal refinement. We assume that the spatial refinement remains fixed at each instant of time in x3, but the method can easily be extended to relax this assumption. Assuming that y is Poisson distributed with mean vector μ, the ST-PCLM is given by: (1) where γf denotes the fine-scale latent mean, Cst is the composition matrix that describes how these latent means are combined to yield μ, and fst(x1, x2, x3) represents the fine-scale spatio-temporal trend. Several approaches can be used to model that function, e.g.: Gaussian random fields, spatio-temporal kriging or penalized splines, among others. We propose the last option since smoothness might be a reasonable property to ask for: if we are estimating an unknown distribution, it seems natural to assume smoothness since we do not know much about the underlying process (the use of the other techniques mentioned is out of the scope of this paper and will be subject of further work). We assume that the non-separable function fst(x1, x2, x3) may be represented by the product of simpler functions , , and . Therefore, the basis representation of the function would be given by: (2) where B1 = B(x1), B2 = B(x2), and B3 = B(x3) are univariate B-spline bases [16] of dimensions m × c1, m × c2 and Tf × c3 respectively. The matrix operators ⊗ and □ on the right-hand side of Eq (2) represent Kronecker and Box products (row tensor products), respectively [31] (if the spatial refinement changes over time, Box product would be used instead of Kronecker product). To achieve smoothness, we use an anisotropic penalty (based on second order differences constructed separately for each independent variable) over the vector of regression coefficients θ. The form of this matrix is as follows: (3) where denotes an identity matrix of dimension cd × cd, λd is the smoothing parameter that controls the amount of smoothing along the coordinate xd, and is the marginal penalty matrix based on Dd, which computes qd-th differences, i.e., , for d = 1, 2, 3. As we mentioned earlier, the matrix in Eq (3) has an anisotropic property in the sense that allows a different amount of smoothing for each coordinate.

Since we are assuming that the spatial refinement is fixed over the temporal refinement, the composition matrix Cst in model (1) is obtained as: (4) where Cs and Ct are the spatial and temporal composition matrices of dimensions n × m and Ta × Tf, respectively. The structure of these matrices depends on the type and level of aggregation. For example, if we want to estimate the latent distribution at a fine spatial grid (over a study area), the entries of the spatial composition matrix will be: (5) where (x1j, x2j) are the cell centroid coordinates of the fine grid, for i = 1, …, n and j = 1, …, m. Another option to construct Cs is to consider its entries as the area proportions that each grid cell shares with a specific unit. The temporal composition matrix Ct is used to disaggregate coarse time intervals into detailed time periods (for example, from years or trimesters to months, weeks, or days). In our application, we show the structure that Ct will have, specifically for our purposes. Note that if Cs = In, , and the unit centroids are used as spatial coordinates, the presented methodology is reduced to the Poisson version of the proposal given by [13] for smoothing of spatio-temporal count data.

When spatial data are recorded over a coarse rectangular grid, an appropriate definition for Bst in Eq (2) is Bst = B3 ⊗ (B2B1), where the spatial refinement correspond to the cell centroid coordinates of a fine grid. The penalty matrix Pst in Eq (3) is still valid in this context, because its definition is independent of data structure. Moreover, the spatial composition matrix would be given by Cs = C2C1, where each Cd is constructed according to the disaggregation of the coarse grid cells into small ones, for d = 1, 2. Although the Kronecker structure in (4) is a computational advantage, our model can cope with more complex situations where the spatial disaggregation changes from one time point to another (for example, census tracks changes along the years) by simply changing the structure of Cst.

Mixed model representation.

Now, we show how to incorporate fine-scale population information into the model (1). To do so, we must reformulate the model as a Generalized Linear Mixed Model (GLMM) by following the proposal of [13] as it is briefly described below.

In the P-splines literature, [13] show a nice representation of the spatio-temporal trend as a mixed model. Following their approach, the term fst(x1, x2, x3) in model (2) can be rewritten as fst(x1, x2, x3) = Bst θ = x β + Z α. Thus, we have: (6) where X and Z are fixed and random effects matrices, and β and α are their associated coefficients, respectively. Random effects have covariance matrix G that depends on the smoothing parameters. The model includes an offset term log(ef) that allows the analysis of mortality or incidence rates instead of counts. The vector ef could be, for example, fine-scale population information or expected number of deaths. If we only have the offset term at the coarse scale, i.e., , a naive approach to estimate ef assumes that the elements of e are evenly distributed throughout the fine resolution. Therefore, we can compute these naive estimates as , where denotes the Moore-Penrose inverse of Cst.

The construction of the matrices X, Z, and G in model (6) are obtained by using the singular value decomposition (SVD) of each discrete penalty matrix Pd in Eq (3), for d = 1, 2, 3. The mixed model matrices are given by: (7) where Xd = Bd Udn and Zd = Bd Uds are constructed from the matrices of singular vectors corresponding to null and non-null singular values of the SVD of Pd, Udn and Uds, respectively, for d = 1, 2, 3. Denoting as the diagonal matrix with the non-null singular values of the SVD of Pd in the main diagonal, the inverse of the covariance matrix G becomes the block-diagonal matrix: (8) where:

If data are spatially recorded over a rectangular coarse grid, the corresponding mixed model matrices are obtained as in Eq (7), but replacing the Box products □ by Kronecker products ⊗. The formulation of G−1 remains the same.

Parameter estimation.

Once the ST-PCLM defined in (6) is in the GLMM framework, it is possible to estimate its parameters. This estimation procedure was presented by [20] in a spatial disaggregation context, and it involves two interrelated stages: (a) estimation of fixed coefficients and random effects (β and α); and (b) estimation of smoothing parameters (λ1, λ2, and λ3). The penalized quasi-likelihood (PQL) methods of [32] are used for stage (a), and the restricted (or residual) maximum likelihood (REML, [32, 33]) is used for stage (b) as a numerical optimization criterion for smoothing parameter selection. Technical details are provided in [20] and, thus, we only describe here the necessary results.

For given values of λ1, λ2, and λ3, the estimation of the fixed and random effects coefficients of the model (6) are obtained by maximizing the following approximate penalized log-likelihood: (9) where 1 denotes a vector of ones of length nTa and μ = Cst γf, with γf = ef * exp(X β + Z α). Differentiation of Eq (9) with respect to β and α leads to the score equations: (10) where and are called ‘working’ mixed model matrices, since W = diag(μ) and Γ = diag(γf) change during the estimation procedure. Defining the working vector as: the solution of the score equations in (10) via Fisher scoring algorithm is expressed as the iterative solution of the system: (11) where b = G−1 α. This yields to a modified version of the standard mixed model estimators: (12) (13) where: (15)

Conditioning on the estimates (12) and (13), the smoothing parameters λ1, λ2, and λ3 can be estimated by maximizing the approximate REML (see [32, Eq. 13]): (16) where λ1, λ2, and λ3 are involved in V through G. Therefore, optimal estimates for the ST-PCLM parameters are obtained by iteration between Eqs (12), (13) and (16), until convergence.

Once the ST-PCLM parameter estimates at convergence are obtained, we can derive standard errors for by using the Bayesian approximation of the variance-covariance matrix for (see [34] for further details). Thus, the approximate standard errors for are obtained by taking the square root of , which is obtained as:

We should note that the estimation procedure described above can be computationally inefficient if we are working with large datasets and the goal is to obtain estimates at a very fine scale. Furthermore, the direct creation of the matrices Csf, X, and Z in model (7) can easily lead to storage problems. In the next Sections, we will provide solutions for the ST-PCLM estimation in terms of storage and efficiency (these solutions were not available in [20]).

Fast algorithm for spatio-temporal penalized composite link models.

In order to efficiently compute estimates of the smoothing parameters, we have adapted the SAP (Separation of Anisotropic Penalties) algorithm of [35] to the ST-PCLM context. To avoid possible storage problems, we have used the so-called GLAM (Generalized Linear Array Methods, [31, 36]) in the ST-PCLM setting. These methods also provide an efficient way to compute the matrix of cross-products required for the SAP algorithm, easing the computation time of model estimation.

Under the ST-PCLM approach and conditioning on the estimates given in Eqs (12) and (13), we can estimate λ1, λ2, and λ3 by numerical maximization of the approximate REML in Eq (16). The usual algorithms used to approximate the solutions, such as the one proposed by [37] (extended to the generalized case by [38]), can only deal with situations in which the variance-covariance matrix is linear on the variance parameters. In the case of spatio-temporal models with anisotropic penalties, regression parameters are affected by more than one smoothing parameter, and so standard approaches can’t be used, since the corresponding variance-covariance matrix does not have the required form. [35] proposed the SAP algorithm which provides a numerical solution to REML estimates, but it is able to deal with models that have a precision matrix for the random effect vector that is linear in the inverse of the variance parameters. These precision matrices are common when penalized smooth models with anisotropic penalties are reformulated as (generalized) linear mixed models, where the smoothing parameters are seen as ratios of variance components, i.e., , for d = 1, 2, 3. Since we are working under a Poisson framework, the dispersion parameter, ϕ, is equal to 1. Thus, the problem is reduced to obtain estimates for the variance components , , and . The reformulation of the SAP algorithm for the ST-PCLM is described below.

Following [35] we can derive closed-form expressions, from the approximate REML, for the variance components , for d = 1, 2, 3. These estimates are given by: (17) where: (18) with N defined in Eq (15), and The non-null submatrices of each Λd were previously defined in Section. Here the inverse of the covariance matrix G in model (6) is written in terms of ’s and can be decomposed as , where the capital lambdas are defined above. The SAP algorithm for the ST-PCLM parameter estimation is given in Algorithm 1, which is an adaptation of the algorithm provided in [35, p. 945].

Algorithm 1 SAP algorithm for the ST-PCLM parameters estimation

Require: Convergence tolerances ν1 and ν2 (e.g., 1 × 10−6) and maximum number of iterations maxit1 and maxit2 (e.g., 100).

1: Set initial values for the mixed model coefficients β and α, and the variance components , , and (for example, with length q1 q2 q3, with length (c1 c2 c3q1 q2 q3), and ). Set k = 0.

2: for 1 to maxit1 do

3:  Given the current estimates for the mixed model coefficients, construct the matrix of weights W and the working vector z as follows: with .

4:  for 1 to maxit2 do

5:   Given the current estimates for the variance components, obtain new estimates for β and α by solving the system in Eq (11). The resulting estimates are denoted as and , respectively.

6:   Obtain new estimates for the variance components using Eq (17). The resulting estimates are denoted as , for d = 1, 2, 3.

7:   Compare new variance component estimates with the previous ones, using the following convergence criterion:

8:   If the convergence tolerance is achieved, break, otherwise set and repeat steps 5, 6, and 7 until convergence.

9:  end for

10:  Compute a new estimate for the fine-scale smooth trend vector using the fixed and random effects estimates obtained in the last iteration of step 5. The resulting vector is denoted as . Compare the new estimate with the previous one, using the following convergence criterion:

11:  If the convergence tolerance is achieved, break, otherwise set k = k + 1 and repeat steps 2 to 11 until convergence.

12: end for

Although we have suggested setting the maximum number of iterations to 100, our experience is that the number of iterations needed to achieve convergence is much smaller. In general, we have observed that the maximum number of iterations to obtain optimal variance components is greater than the number of iterations to obtain optimal estimates for the linear predictor. In the analysis of Q-fever data, convergence was reached after 30 iterations.

We should note that we can efficiently compute the trace in Eq (18) by taking into account that G Λd G is a diagonal matrix. Thus, we only have to compute the diagonal of to obtain this trace. From [39, Eq. (5.3)], we have that: Therefore, the diagonal elements of matrix are obtained by the column-wise addition of:

An advantage of using the adapted SAP algorithm provided above is that we can directly compute the effective dimension (ED) of model (6). This model complexity measure is given by: (19) where each edd is computed from Eq (18). The first term on the right-hand side of Eq (19) corresponds to the dimension of the unpenalized (or fixed) part, whereas the second corresponds to the unpenalized (or random) part of the fitted model. To visualize the later statement, note that: where is the ‘hat matrix’ [40] of the unpenalized part of the fitted ST-PCLM (see Eq (13)).

Note that the matrix cross-products , , and (and its transpose) are involved in the computation of the variance component estimates. They also appear in the estimation of fixed and random effects coefficients, which are obtained by solving the system in Eq (11). These and other required matrix cross-products can be efficiently computed by adapting the so-called GLAM methods to the ST-PCLM setting, which we will show in the next Section.

GLAM methods for spatio-temporal penalized composite link models.

When we deal with estimating latent trends in multiple dimensions, we are susceptible to problems with storage and computational burden. In the case of data arranged in multidimensional grids, these problems can be overcome using the GLAM methods developed in [31, 36]. These methods are designed to avoid direct computation of matrix cross-products where Kronecker operations are involved, by using sequences of nested matrix operations. In this Section, we show the use of these methods in the ST-PCLM context.

First consider the matrix-by-vector products X β, Z α, and Cst γf that appear in line 3 of Algorithm 1. These expressions can be computed as: with , , , , where ρ and denote the rotated transform and the row-tensor product, respectively, defined in the S1 File. The symbol ≡ means that both sides have the same elements but in a different order. The matrices , , and , for k = 1, …, 7, are arrangements of the vectors β, γ, and αk, respectively, with . The dimensions of these matrices correspond to the number of columns of the first matrix where ρ acts, times the number of columns of the second matrix where ρ acts (i.e., has dimension ncol(R1) × ncol(x3) = q1 q2 × q3, has dimension ncol(Cs) × ncol(Ct) = m × Tf, and so on). Therefore, it holds that , , and , for k = 1, …, 7.

Next consider the matrix cross-products , , , (which is equal to ), , and that appear in line 5 in Algorithm 1 (i.e., in the system in Eq (11)). Note first that they can be reduced as: Therefore, we only need to compute Cst Γ x and Cst Γ Z. These expressions are obtained as follows:

Results

Once the ST-PCLM has been stated, we analyzed the data related to Q fever outbreaks in the Netherlands.

Weekly high-resolution smooth incidence maps

As mentioned above, we have focused on the distribution of Q fever incidence in the south of the Netherlands during 2009. Some of this data were analyzed in [41], the authors focused on the construction of a smooth incidence map at a fixed time, without considering the need to obtain predictions that were consistent with the aggregated data. More recently [42] focused on the spatial transmission of the disease.

Let’s remember that the data are aggregated at municipality and monthly levels, and that our goal is to obtain estimates of Q fever incidence at that fine grid and at each week of 2009 (i.e., disaggregate simultaneously in space and time) to obtain better insights of the evolution of the disease. The ST-PCLM approach, allows to visualize Q fever incidence at a finer spatio-temporal resolution, and also to incorporate fine-scale population information to the estimation of the latent process. Here we use two sources of information: 1) The Q fever counts that are available in the municipalities of the study area described in Fig 2b and months of 2009 (i.e., the initial spatial and temporal scales are municipalities and months, respectively), and 2) population measured at a fine spatial grid over the study area. Population information at fine spatial grids is available from the WorldPop project. The data containing the spatial distribution of population in 2009 in the Netherlands can be downloaded from https://www.worldpop.org/geodata/summary?id=42722.

Fig 3a shows a fine grid composed of 4871 regular cells of size 1000 m × 1000 m. Blue dots represent the spatial coordinates of the centroids of these cells, and Fig 3b shows the spatial distribution of the population on this fine grid, which is heterogeneous across municipalities. To set up the ST-PCLM formulation, we have used: centroids of the cells described in Fig 3a as fine-scale spatial coordinates, i.e., as x1 and x2, x3 = (1, …, 53)′ (since 53 weeks were observed in 2009), second order penalties, 12 equally-spaced knots for the marginal cubic B-spline bases B1 and B2, and 8 equally-spaced knots for the marginal cubic B-spline basis B3. We have assumed here that the population at the fine grid is constant throughout the time period; thus, ef in model (6) is considered as a vector obtained by repeating the fine-scale population fifty three times. The elements of the spatial composition matrix are obtained using Eq (5), whereas the temporal composition matrix for this case has the following form: where Sunday is considered the first day of the week. As opposed to the spatial composition matrix, the matrix Ct has some entries that are fractions. This is because some months share parts of a specific week (for example, some days of week 14 belong to March and the others to April).

thumbnail
Fig 3. The map on the left shows the fine grid of 1000 × 1000 m cells in the study area shown in Fig 2b.

The map on the right shows the spatial distribution of the population on this fine grid. Source: Estimated population density per grid-cell from the WorldPop project in 2009, the Netherlands https://www.worldpop.org/geodata/summary?id=42722. a) Fine grid (cells of size 1000x1000 m). b) Population on grid.

https://doi.org/10.1371/journal.pone.0263711.g003

Fig 4 shows the resulting ST-PCLM Q fever incidence (per 100000 inhabitants) at the target fine spatial resolution, for six selected weeks. These incidences are obtained as . The evolution of the incidence varies across municipalities and weeks, where the highest incidences are observed mostly around week 19. Most of those weeks belong to April, May, and June, which have the largest number of Q fever outbreaks observed in 2009 (see Fig 1). Note also that most of the highest incidences in week 19 are spatially concentrated around the area that includes points A and C in Fig 4, which are located in the municipalities of Landerd and Heusden, respectively (see Fig 2b). Fig 5 shows the approximate standard error maps associated with Fig 4. As expected in a Poisson setting, larger variances are found in areas with higher incidence rates.

thumbnail
Fig 4. Smoothed Q fever incidence at a detailed spatio-temporal scale, resulting from the ST-PCLM approach, for six selected weeks.

https://doi.org/10.1371/journal.pone.0263711.g004

thumbnail
Fig 5. Approximate standard error maps associated with the smoothed Q fever incidence maps in Fig 4.

https://doi.org/10.1371/journal.pone.0263711.g005

From the previous ST-PCLM estimates, we can also visualize the disaggregated (weekly) temporal evolution of the Q fever disease at specifics spatial coordinates of the fine grid. Fig 6 shows the down-scaled smoothed temporal incidence (per week) at three specific locations A, B, and C, in the study area. We observe that the temporal evolution of the incidence at point B is constant and almost zero, whereas, at points A and C, the temporally smoothed incidence present a unimodal behaviour, where the peak is reached around week 19 (May). This is consistent with the summaries of the incidences given in Figs 1 and 2.

thumbnail
Fig 6. Weekly temporal evolution of Q fever incidence in three specific points (A, B, and C), spatially presented in Fig 4 on the high-resolution map at week 19.

https://doi.org/10.1371/journal.pone.0263711.g006

Simulation studies

In this Section, we present one of two simulation studies that have been carried out to examine the predictive performance of the model under different scenarios. In both of them, we have done spatial and temporal disaggregation, but the focus of each simulation exercise is different. The first aims to check the performance of the ST-PCLM approach under different degrees of spatial dependence. In the second, we will study the behaviour of the model under different levels of temporal disaggregation (results from this second study are available in the S1 File).

Simulation study 1.

Data are generated using the fine grid in Fig 3a as the spatial region of study, and the 53 weeks in a year as the disaggregated temporal scale. The simulation is conducted as follows:

  1. The fine-resolution incidence vector is constructed based on the smoothed Q fever incidences obtained in the previous section. We denote these smoothed incidences as inc(uk), where uk, k = 1, …, K, with K = 4371 × 53 = 258163, represents the spatio-temporal coordinates at fine resolution. The different levels of spatial dependence are achieved by changing the values of the variance components that control the spatial term in the model, and (the variance component for the temporal term, , will remain fixed at the optimal value obtained in the fit). The values for the optimal variance components in the fit were: , , and . Based on these results we set 3 different scenarios:
    • Scenario 1: Variances for the spatial component are those in the fit.
    • Scenario 2: Variances for the spatial component are 100 times larger than those in the fit.
    • Scenario 3: Variance for the spatial component are 1000 times smaller than those in the fit.
  2. Calculation of the aggregated expected number of cases (over municipalities and months), μkl, k = 1, …, 72, l = 1, …, 12:
  3. 100 realizations of the number of cases in each municipality and month are generated through a random drawing from a Poisson distribution with mean parameter μ.

Fig 7 shows the incidences, incg(uk), corresponding to week 19 (which is the one with the highest number of cases) used in the three scenarios of simulation study 1. They reflect the three different levels of smoothness obtained by increasing or decreasing the variance components that control the spatial effect in the model.

thumbnail
Fig 7. Plots of incidences (for one of the weeks) used in each simulated scenario.

https://doi.org/10.1371/journal.pone.0263711.g007

For all realizations l = 1, …, 100, the predicted incidences obtained from the ST-PCLM approach for each scenario g, with g = 1, 2, 3, were compared to the smoothed incidences incg(uk). To evaluate the performance of the model we computed, for each scenario, the mean absolute error (MAE), the root mean squared error (RMSE): the correlation between the observed and predicted incidence, and the percent of grid cells with true incidence falling within 95% prediction intervals of the model. These metrics were averaged over 100 simulated data sets and they are summarized in Table 1.

thumbnail
Table 1. Performance comparison of the ST-PCLM approach in three different types of scenarios of simulation study 1.

Correlation coefficient average (avg) of the 100 replicates, average coverage % and mean absolute errors (MAE) and root mean squared error (RMSE) are also shown.

https://doi.org/10.1371/journal.pone.0263711.t001

The model performance criteria reported in Table 1 show good model performance in terms of prediction accuracy in all scenarios. All criteria showed that slightly worse predictions where obtained in scenario 2, this is mainly due to the fact that, for consistency, we have used the same size of B-spline basis for all scenarios (those used in the data analysis), but in the case of rapidly changing spatial patters (as is the case in scenario 2), a larger basis would be necessary to correctly capture the spatial effect, or adaptive P-splines [43] could be used; however, this approach is beyond of the scope of this paper.

In Fig 8 we provide further insight on the 95% coverage achieved in scenario 1 (similar plots for the other scenarios can be found in the S1 File). The lowest coverage over time (averaged over all pixels) was 85%, and was found at the extremes of the time interval (corresponding to the periods of lower incidence). The coverage for each pixel at the fine-grid resolution (averaged over time), shows that areas with lower coverage (60%–80%) correspond to locations with the highest number of cases. However, the coverage for these pixels is not constant over time, reaching almost 100% in weeks with low incidence rates. In scenario 2, the results are similar, although the coverage is lower (especially for a small number of pixels where the averaged coverage is between 20% and 30%), this is mainly due to the poor fitting for points with a sudden high incidence pick. The coverage in scenario 3 (the smoothest spatial trend) is nearly constant over the grid, ranging from 90% to 100%.

thumbnail
Fig 8. Percentage of cells in the grid with true incidence falling within 95% prediction in scenario 1.

On the left, averaged coverage per week over all cells in the grid, and on the right, averaged coverage per cell over all weeks.

https://doi.org/10.1371/journal.pone.0263711.g008

Taking into account the fact that we are predicting over more than 4000 grid cells and 53 weeks using aggregated data from 72 municipalities and 12 months, these results indicate a good predictive performance of the model.

Discussion

We have presented a novel model for the disaggregation of grouped data in both space and time, based on the spatio-temporal penalized composite link model approach. This framework was used to model Q fever counts (recorded in municipalities and months) to obtain Q fever incidence estimates over the fine grid and weeks. The model allows to obtain detailed trends in disease incidence, mortality risks, or any other vital rates at a desirable fine spatio-temporal resolution. Therefore, the resulting ST-PCLM outcomes can be displayed as a dynamic map. It also allows to include population information at a fine resolution in the estimation process. The flexibility of the model is provided by the use of B-splines, along with a penalty on the regression coefficients, and the link between the areas and the fine-resolution grid is achieved through the composition matrix. Our proposal can also address situations in which population information is recorded over small spatial units that are nested in coarser ones (for example, from municipalities to census tracts). In that case, the centroids of the small units can be used to represent the fine spatial scale.

Furthermore, the model allows the incorporation of covariates of interest (such as, for example, socio-economic, demographic, and environmental factors) in the ST-PCLM formulation to improve the estimation of the latent trend. They can be included at the aggregated level or at the fine-scale level (or at both levels simultaneously). The inclusion of covariates at the fine-scale level is immediate by adding them as columns in the design matrix X in (6) (if a linear relationship is assumed), or adding columns in Z if a non-linear relationship is expected. Details on how to include explanatory variables measured at the aggregated level can be found in [20].

It is important to acknowledge the use of GLAM methods in conjunction with the SAP algorithm, to avoid storage problems and to speed up computations. However, we are aware that the disaggregation of grouped data into a very detailed resolution could lead to increased computational load and storage problems. The sparsity of the marginal composition matrices can be exploited to deal with these issues (see, for example, [44]).

We have conducted two simulation studies to evaluate the prediction accuracy of the ST-PCLM. The first study aimed to check the performance of the model under different degrees of spatial dependence, and the second to asses the effect of different levels of temporal disaggregation. The overall performance of the model was good, although, as expected, it was affected by the number of units available at the aggregated level.

In the ST-PCLM formulation, we assume the spatial unit boundaries remain fixed over time. But it may happen that some of these boundaries change over the years. This issue is known in the statistical literature as the spatio-temporal misalignment problem. As future work, we plan to extend the ST-PCLM approach to handle this problem, where marginal composition matrices will play an important role. Furthermore, we can exploit the unique relationship between the penalties associated with basis and the covariance structure they yield to explore the use of other common spatio-temporal covariance matrices.

The implementation of our proposal and all data analyses were carried out using the statistical software R [45].

References

  1. 1. Waller LA, Gotway CA. Applied Spatial Statistics for Public Health Data. John Wiley & Sons, New York; 2004.
  2. 2. MacNab YC, Dean CB. Autoregressive spatial smoothing and temporal spline smoothing for mapping rates. Biometrics. 2001;57:949–956. pmid:11550949
  3. 3. Ugarte MD, Goicoa T, Militino AF. Spatio-temporal modeling of mortality risks using penalized splines. Environmetrics. 2010;21(3-4):270–289.
  4. 4. Waller LA, Carlin BP, Xia H, Gelfand AE. Hierarchical Spatio-Temporal Mapping of Disease Rates. Journal of the American Statistical Association. 1997;92(438):607–617.
  5. 5. Knorr-Held L. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine. 2000;19(17-18):2555–2567. pmid:10960871
  6. 6. Martínez-Beneito MA, López-Quilez A, Botella-Rocamora P. An autoregressive approach to spatio-temporal disease mapping. Statistics in Medicine. 2008;27(15):2874–2889. pmid:17979141
  7. 7. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(2):319–392.
  8. 8. Schrödle B, Held L. Spatio-temporal disease mapping using INLA. Environmetrics. 2011;22(6):725–734.
  9. 9. Ugarte MD, Adin A, Goicoa T, Militino AF. On fitting spatio-temporal disease mapping models using approximate Bayesian inference. Statistical Methods in Medical Research. 2014;23(6):507–530. pmid:24713158
  10. 10. Bauer C, Wakefield J, Rue H, Self S, Feng Z, Wang Y. Bayesian penalized spline models for the analysis of spatio-temporal count data. Statist Med. 2016;35(11):1848–1865. pmid:26530705
  11. 11. Gelfand AE, Diggle PJ, Fuentes M, Guttorp P. Handbook of Spatial Statistics. CRC Press, Boca Raton, Florida; 2010.
  12. 12. Eilers PCH. Ill-posed problems with counts, the composite link model and penalized likelihood. Statistical Modelling. 2007;7(3):239–254.
  13. 13. Lee D.-J., Durban M. P-spline ANOVA-type interaction models for spatio-temporal smoothing. Statistical Modelling. 2011;11(1):49–69.
  14. 14. Lee D.-J., Durbán M. Smooth-CAR mixed models for spatial count data. Computational Statistics and Data Analysis. 2009;53(8):2968–2979.
  15. 15. Rodríguez-Álvarez MX, Durbán M, Lee D.-J., Eilers PHC. Spatio-temporal adaptive penalized splines with application to Neuroscience; 2016. Available from: http://arxiv.org/abs/1610.06860.
  16. 16. Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11(2):89–121.
  17. 17. Berke O. Exploratory disease mapping: Kriging the spatial risk function from regional count data. International Journal of Health Geographics. 2004;3(18):–. pmid:15333131
  18. 18. Goovaerts P. Geostatistical analysis of disease data: Accounting for spatial support and population density in the isopleth mapping of cancer mortality risk using area-to-point Poisson kriging. International Journal of Health Geographics. 2006;5(52). pmid:17137504
  19. 19. Diggle PJ, Moraga P, Rowlingson B, Taylor BM. Spatial and spatio-temporal log-gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science. 2013;28(4):542–563.
  20. 20. Ayma D, Durbán M, Lee D.-J., Eilers PHC. Penalized composite link models for aggregated spatial count data: A mixed model approach. Spatial Statistics. 2016;17:179–198.
  21. 21. Allcroft DJ, Glasbey CA. A latent Gaussian Markov random-field model for spatiotemporal rainfall disaggregation. Journal of the Royal Statistical Society Series C: Applied Statistics. 2003;52(4):487–498.
  22. 22. Prairie J, Rajagopalan B, Lall U, Fulp T. A stochastic nonparametric technique for space-time disaggregation of streamflows. Water Resources Research. 2007;43(3).
  23. 23. Segond ML, Neokleous N, Makropoulos C, Onof C, Maksimovic C. Simulation and spatio-temporal disaggregation of multi-site rainfall data for urban drainage applications. Hydrological Sciences Journal. 2007;52(5):917–935.
  24. 24. Schleiss M, Berne A. Stochastic space-time disaggregation of rainfall into dsd fields. Journal of Hydrometeorology. 2012;13(6):1954–1969.
  25. 25. Bindhu VM, Narasimhan B. Development of a spatio-temporal disaggregation method (DisNDVI) for generating a time series of fine resolution NDVI images. ISPRS Journal of Photogrammetry and Remote Sensing. 2015;101:57–68.
  26. 26. Ballas D, Rossiter D, Thomas B, Clarke G, Dorling D. Geography Matters: Simulating the local impacts of national social policies. Joseph Rowntree Foundation; 2005. 3. Available from: http://www.dannydorling.org/wp-content/files/dannydorling_publication_id1233.pdf.
  27. 27. Brown GL. Q Fever. British Medical Journal. 1973;2(5857):43–45. pmid:4735198
  28. 28. Roest HIJ, Tilburg JJHC, Van Der Hoek W, Vellema P, Van Zijderveld FG, Klaassen CHW, et al. The Q fever epidemic in the Netherlands: History, onset, response and reflection. Epidemiology and Infection. 2011;139(1):1–12. pmid:20920383
  29. 29. van der Hoek W, Dijkstra F, Schimmer B, Schneeberger PM, Vellema P, Wijkmans C, et al. Q fever in the Netherlands: An update on the epidemiology and control measures. Eurosurveillance. 2010;15(12):pii:19520. pmid:20350500
  30. 30. Thompson R, Baker RJ. Composite Link Functions in Generalized Linear Models. Applied Statistics. 1981;30(2):125.
  31. 31. Eilers PHC, Currie ID, Durbán M. Fast and compact smoothing on large multidimensional grids. Computational Statistics & Data Analysis. 2006;50(1):61–76.
  32. 32. Breslow NE, Clayton DG. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association. 1993;88(421):9.
  33. 33. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58(3):545–554.
  34. 34. Lin X, Zhang D. Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1999;61(2):381–400.
  35. 35. Rodríguez-Álvarez MX, Lee D.-J., Kneib T, Durbán M, Eilers PHC. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Statistics and Computing. 2015;25(5):941–957.
  36. 36. Currie ID, Durbán M, Eilers PHC. Generalized linear array models with applications to multidimensional smoothing. J R Statist Soc B. 2006;68(2):259–280.
  37. 37. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72(358):320–338.
  38. 38. Schall R. Estimation in generalized linear models with random effects. Biometrika. 1991;78(4):719–727.
  39. 39. Harville DA. Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York; 1997.
  40. 40. Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall, London; 1990.
  41. 41. van der Hoek W, Van De Kassteele JVD, Bom B, DE Bruin AD, Dijkstra F, Schimmer B, et al. Smooth incidence maps give valuable insight into Q fever outbreaks in the Netherlands. Geospatial Health. 2013;7(1):127–134.
  42. 42. De Koeijer AA, Hagenaars TJ, Van Leuken JPG, Swart AN, Boender GJ. Spatial transmission risk during the 2007- 2010 Q fever epidemic in the Netherlands: Analysis of the farm-to-farm and farm-toresident transmission. PLoS ONE. 2020;15(2):e0227491.
  43. 43. Rodríguez-Álvarez MX, Durban M, Lee D.-J., Eilers PHC. On the estimation of variance parameters in non-standard generalised linear mixed models: application to penalised smoothing. Statistics and Computing. 2019;29(3):483–500.
  44. 44. Bates D, Martin M. Package ‘Matrix’ Sparse and Dense Matrix Classes and Methods Author; 2015. Available from: https://cran.r-project.org/package=Matrix.
  45. 45. Team R Development Core. A Language and Environment for Statistical Computing; 2018. Available from: http://www.r-project.org.