Modeling latent spatio-temporal disease incidence using penalized composite link models

Dae-Jin Lee; María Durbán; Diego Ayma; Jan Van de Kassteele

doi:10.1371/journal.pone.0263711

Abstract

Epidemiological data are frequently recorded at coarse spatio-temporal resolutions to protect confidential information or to summarize it in a compact manner. However, the detailed patterns followed by the source data, which may be of interest to researchers and public health officials, are overlooked. We propose to use the penalized composite link model (Eilers PCH (2007)), combined with spatio-temporal P-splines methodology (Lee D.-J., Durban M (2011)) to estimate the underlying trend within data that have been aggregated not only in space, but also in time. Model estimation is carried out within a generalized linear mixed model framework, and sophisticated algorithms are used to speed up computations that otherwise would be unfeasible. The model is then used to analyze data obtained during the largest outbreak of Q-fever in the Netherlands.

Citation: Lee D-J, Durbán M, Ayma D, Van de Kassteele J (2022) Modeling latent spatio-temporal disease incidence using penalized composite link models. PLoS ONE 17(3): e0263711. https://doi.org/10.1371/journal.pone.0263711

Editor: Jaroslaw Harezlak, Indiana University, UNITED STATES

Received: March 5, 2021; Accepted: January 25, 2022; Published: March 10, 2022

Copyright: © 2022 Lee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The dataset contain personal identifying information and are therefore protected by the prevailing privacy legislation; currently the General Data Protection Regulation of the EU (GDPR). For that reason these datasets are not publicly available. Contact information for these datasets are as follows: Human Q-fever: The National Institute for Public Health and the Environment, info@rivm.nl. The human population density data can be obtained from Statistics Netherlands (CBS): https://www.cbs.nl/nl-nl/dossier/nederland-regionaal/geografische-data/gegevens-per-postcode.

Funding: This study was supported in the form of funding by Ministerio de Ciencia, Innovación y Universidades (Grant Nos. SEV-2017-0718 and MTM2017-82379-R) awarded to DJL, (Grant No. PID2019-10490RB-100) awarded to MD, by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. MTM2014-52184-P) awarded to DJL, MD, and DA, and by Agencia Estatal de Investigación (Grant No. PID2020-115882RB-I00) awarded to DJL.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent decades, the development of spatio-temporal statistical methods in Epidemiology, the cornerstone of Public Health, has been remarkable. Such development has been mainly driven by advances in geographic information systems (GISs), access to reliable health data registers, and the availability of powerful software capabilities to process and analyze large amounts of data. The methodological contributions to the analysis of spatio-temporal health data come from several interdisciplinary researchers, whose backgrounds are mostly related to Statistics, Geography, Environmental Sciences, and Epidemiology.

Within the diversity of of epidemiological research, disease mapping has attracted much interest in Public Health as it helps to visualize disease incidence of mortality risk patterns in a specific area. To this end, appropriate statistical methods have been applied to health data, which are usually recorded per spatial units, to provide smoothed disease incidences per unit. Smoothing is performed to obtain more stable and less noisy estimates of the incidence rates associated with each unit [1], which helps to determine meaningful spatial patterns. By adding the temporal dimension to this context, it is possible to examine the evolution of disease incidence in each unit, during a certain period of time (generally divided in years), but it implies a challenge for smoothing data, in terms of computational time and storage, even more if disaggregation is also needed in time. Several techniques have been proposed for the spatio-temporal smoothing of health data; most of which are developed under an empirical Bayes approach, where B-splines are used [2, 3] or a hierarchical Bayesian framework where conditional autoregressive (CAR) structures are included [4–6]. Regarding this last approach, methods that use integrated nested Laplace approximations (INLA, [7]) have recently been proposed (see [8–10]; among others).

All the works mentioned above provide smoothed estimates that change over time within a unit and across units within time interval, although each estimated value is assigned to the whole unit. Furthermore, most of them can be extended to include covariates or explanatory variables, which must have the same spatio-temporal resolution as health data. Therefore, they restrict the incorporation of population information at fine-scale and other relevant risk factors recorded at a finer resolution (see the modifiable areal unit problem; [11, Ch. 29]). The challenge here is not only that the map is dynamic, but the fact that, in order to obtain a better insight of the evolution of the incidence, detailed maps are needed at a finer spatial and temporal scale than the one provided (for example, grid instead of areas and weeks instead of months) while maintaining the coherence with the aggregated observed counts. To overcome the limitations of previous works, we propose the use of the spatio-temporal penalized composite link model(ST-PCLM) to estimate the latent distribution of spatio-temporally grouped count data. Our proposal is based on the penalized composite link model of [12], combined with a spatio-temporal P-spline methodology [13], to obtain smoothed estimates at a finer resolution from data aggregated over space and/or time. Moreover, the approach allows to include population information at fine-scale and specific random effects or further correlation structure, if necessary. This could be in the form of unstructured variation (including random effects per spatial unit) or structured variation by assuming a known variance-covariance matrix for the spatial random effects, for example using a CAR model as in [14].

As in any latent observation problem, there are infinitely possible solutions that fit the observed data. In that sense, the model is not identifiable, and the use of some prior knowledge or constraints (based on experience or common sense) is required for the problem to be well posed. Because we want to estimate a distribution, it is reasonable to expect a relatively smooth result, so we assume here that the underlying spatio-temporal process behind aggregated data is smooth and stationary (although this last hypothesis could also be relaxed by using multidimensional adaptive smoothing as in [15]). The flexibility of the model is given by the use of B-spline bases and a discrete penalty on the regression coefficients, following the P-spline methodology [16]. Smoothness is controlled by three smoothing parameters (two for the spatial dimension and one for the temporal dimension) that have to be estimated along with the overall trend at the fine scale. To perform the spatio-temporal disaggregation, the ST-PCLM uses a composition matrix that links both coarse and fine resolutions, and which is expressed as a Kronecker product of two marginal composition matrices (one acting at a spatial level and another at temporal level).

We can find several techniques that allow the spatial disaggregation of health data. For example, the use of an interpolation process from empirical Bayes estimates [17], (generalized) Poisson kriging methods [18], log-Gaussian Cox processes [19], and spatial composite link mixed models [20] have been suggested to produce a continuous smooth surface (across the study area) from regional health data. However, as far as we know, there is no appropriate model to address the problem of disaggregation of health data in both space and time (although there are some works about spatio-temporal disaggregation in a Gaussian context; see, for example, [21–25]). A common among social scientist is geographical microsimulation (see for example [26]). Geographical microsimulation models simulate populations in given geographical areas, with population characteristics as close as possible to their real counterparts. But a drawback may be that a representative sample of individuals at the fine-scale data required by those methods is not always available. The methodology presented here allows to create detailed dynamic maps for disease incidence data. Publicly available maps in aggregated form over space and time are the only information required, as well as geographical locations and time points where a finer resolution prediction is required.

Materials and methods

Q fever data

Q fever is a widespread zoonotic disease caused by the bacterium Coxiella burnetii. C. burnetii transmission to humans is mainly associated with ruminants such as cattle, sheep, and goats. During parturition or abortion of infected animals, high numbers of C. burnetii are shed within the amniotic fluids and the placenta. These organisms end up in the environment, where they may survive for long periods due to their resistance to heat, dryness, and many common disinfectants. Humans are often highly susceptible to the disease, and very few organisms may be necessary to cause infection. More information about this infectious disease is provided in [27].

The Southern Netherlands faced large outbreaks of human Q fever from 2007 to 2010 [28]. In this country, the local municipal health services (MHSs) are responsible for registering all confirmed diagnoses of acute Q fever. The information collected is then entered into the electronic national infectious diseases surveillance database. Due to confidentiality, these data are not publicly available and, in some instances, maybe provided in an aggregated form. In this case, the data were made available monthly at the municipality level.

Fig 1 shows the temporal distribution of Q fever cases (in months) from January 2007 to July 2010. A total of 3806 acute Q fever cases were registered in this period: 192 in 2007, 980 in 2008, 2309 in 2009, and 325 in 2010. The epidemic peaks of each year were observed every spring, specifically during May. This coincides with the birth period of small ruminants (sheep and goats), a fact that was pointed out in several studies about those exceptionally large Q fever outbreaks in the Netherlands (see, for example, [28, 29]). Since the largest outbreak was observed during 2009, we have studied the distribution of Q fever incidence in that year.

Download:

Fig 1. Human Q fever cases in the Netherlands grouped per months, from January 2007 to July 2010.

https://doi.org/10.1371/journal.pone.0263711.g001

Fig 2a shows postal code areas affected by human Q fever (red points) in 2009. However, the number of cases was only publicly available at the municipality level. Since dairy goat and sheep farms are not evenly distributed across the Netherlands, Q-fever infection did not occur in all areas of the country. In this study we have focused on a 60 × 60km area in the south of the Netherlands (see black square in Fig 2a) with a total of 72 municipalities. The total number of Q fever cases reported in these municipalities was 1798. Taking into account the number of inhabitants in each municipality, we can calculate Q fever incidence (per 100000 inhabitants). Fig 2b shows the spatial distribution of the resulting Q fever incidence (aggregated over months in 2009), with higher incidence values observed around the municipalities of Landerd (1439.676), Lith (562.546), and Heusden (295.006). Both Figs 1 and 2 show a smooth distribution of counts in space and time which makes our initial assumption of a smooth latent distribution plausible.

Download:

Fig 2. Map of human Q fever cases in the Netherlands, 2009.

Left: red points indicate the residential addresses of human cases (2309 in total). Right: study area in the south of the Netherlands showing (raw) incidence (per 100000 inhabitants) of Q fever by municipality in 2009. Source: geo-referenced data are obtained from https://www.cbs.nl/en-gb/onze-diensten/open-data/statline-as-open-data/cartography. a) Q fever outbreaks in 2009. b) Q fever incidence in study area.

https://doi.org/10.1371/journal.pone.0263711.g002

The spatio-temporal penalized composite link model

The Spatio-Temporal Penalized Composite Link Model (ST-PCLM) is the name of the model we propose for the spatio-temporal disaggregation of epidemiological data. ST-PCLM combines the penalized composite link model (PCLM) of [12] and spatio-temporal smoothing with P-splines [13]. The PCLM generalizes the model proposed by [30] and allows the estimation of a smooth trend, from grouped data, at a finer resolution. The idea behind it is as follows:

We observe data y from a Poisson distribution with mean μ, then we assume that μ is a composition of the ungrouped distribution γ. The approach assumes that the underlying (ungrouped) distribution is smooth, but otherwise let the data determine their shape.

As PCLM applies to the unidimentional scenario, we develop it in the spatio-temporal context. Then, we have included the ideas presented in [13] to reformulate the new spatio-temporal PCLM under a generalized linear mixed model (GLMM) framework, i.e., ST-PCLM. Hereafter, we will show the development of the ST-PCLM and a computationally efficient way to estimate the ST-PCLM parameters.

The model.

Let y_it, i = 1, …, n, t = 1, …, T_a, denote count data recorded over n non-overlapping spatial units v_i, which constitute the area of interest, at T_a time periods. Suppose that we want to estimate the latent distribution of the vector of counts at a spatio-temporal support that is a refinement of the original one. The fine support is determined by three coordinates: x₁ = (x₁₁, …, x_1m)′ and x₂ = (x₂₁, …, x_2m)′, with m > n, which represent the longitude and latitude coordinates of the spatial refinement, respectively; and , with T_f > T_a, which represents the coordinates of the temporal refinement. We assume that the spatial refinement remains fixed at each instant of time in x₃, but the method can easily be extended to relax this assumption. Assuming that y is Poisson distributed with mean vector μ, the ST-PCLM is given by: (1) where γ_f denotes the fine-scale latent mean, C_st is the composition matrix that describes how these latent means are combined to yield μ, and f_st(x₁, x₂, x₃) represents the fine-scale spatio-temporal trend. Several approaches can be used to model that function, e.g.: Gaussian random fields, spatio-temporal kriging or penalized splines, among others. We propose the last option since smoothness might be a reasonable property to ask for: if we are estimating an unknown distribution, it seems natural to assume smoothness since we do not know much about the underlying process (the use of the other techniques mentioned is out of the scope of this paper and will be subject of further work). We assume that the non-separable function f_st(x₁, x₂, x₃) may be represented by the product of simpler functions , , and . Therefore, the basis representation of the function would be given by: (2) where B₁ = B(x₁), B₂ = B(x₂), and B₃ = B(x₃) are univariate B-spline bases [16] of dimensions m × c₁, m × c₂ and T_f × c₃ respectively. The matrix operators ⊗ and □ on the right-hand side of Eq (2) represent Kronecker and Box products (row tensor products), respectively [31] (if the spatial refinement changes over time, Box product would be used instead of Kronecker product). To achieve smoothness, we use an anisotropic penalty (based on second order differences constructed separately for each independent variable) over the vector of regression coefficients θ. The form of this matrix is as follows: (3) where denotes an identity matrix of dimension c_d × c_d, λ_d is the smoothing parameter that controls the amount of smoothing along the coordinate x_d, and is the marginal penalty matrix based on D_d, which computes q_d-th differences, i.e., , for d = 1, 2, 3. As we mentioned earlier, the matrix in Eq (3) has an anisotropic property in the sense that allows a different amount of smoothing for each coordinate.

Since we are assuming that the spatial refinement is fixed over the temporal refinement, the composition matrix C_st in model (1) is obtained as: (4) where C_s and C_t are the spatial and temporal composition matrices of dimensions n × m and T_a × T_f, respectively. The structure of these matrices depends on the type and level of aggregation. For example, if we want to estimate the latent distribution at a fine spatial grid (over a study area), the entries of the spatial composition matrix will be: (5) where (x_1j, x_2j) are the cell centroid coordinates of the fine grid, for i = 1, …, n and j = 1, …, m. Another option to construct C_s is to consider its entries as the area proportions that each grid cell shares with a specific unit. The temporal composition matrix C_t is used to disaggregate coarse time intervals into detailed time periods (for example, from years or trimesters to months, weeks, or days). In our application, we show the structure that C_t will have, specifically for our purposes. Note that if C_s = I_n, , and the unit centroids are used as spatial coordinates, the presented methodology is reduced to the Poisson version of the proposal given by [13] for smoothing of spatio-temporal count data.

When spatial data are recorded over a coarse rectangular grid, an appropriate definition for B_st in Eq (2) is B_st = B₃ ⊗ (B₂ ⊗ B₁), where the spatial refinement correspond to the cell centroid coordinates of a fine grid. The penalty matrix P_st in Eq (3) is still valid in this context, because its definition is independent of data structure. Moreover, the spatial composition matrix would be given by C_s = C₂ ⊗ C₁, where each C_d is constructed according to the disaggregation of the coarse grid cells into small ones, for d = 1, 2. Although the Kronecker structure in (4) is a computational advantage, our model can cope with more complex situations where the spatial disaggregation changes from one time point to another (for example, census tracks changes along the years) by simply changing the structure of C_st.

Mixed model representation.

Now, we show how to incorporate fine-scale population information into the model (1). To do so, we must reformulate the model as a Generalized Linear Mixed Model (GLMM) by following the proposal of [13] as it is briefly described below.

In the P-splines literature, [13] show a nice representation of the spatio-temporal trend as a mixed model. Following their approach, the term f_st(x₁, x₂, x₃) in model (2) can be rewritten as f_st(x₁, x₂, x₃) = B_st θ = x β + Z α. Thus, we have: (6) where X and Z are fixed and random effects matrices, and β and α are their associated coefficients, respectively. Random effects have covariance matrix G that depends on the smoothing parameters. The model includes an offset term log(e_f) that allows the analysis of mortality or incidence rates instead of counts. The vector e_f could be, for example, fine-scale population information or expected number of deaths. If we only have the offset term at the coarse scale, i.e., , a naive approach to estimate e_f assumes that the elements of e are evenly distributed throughout the fine resolution. Therefore, we can compute these naive estimates as , where denotes the Moore-Penrose inverse of C_st.

The construction of the matrices X, Z, and G in model (6) are obtained by using the singular value decomposition (SVD) of each discrete penalty matrix P_d in Eq (3), for d = 1, 2, 3. The mixed model matrices are given by: (7) where X_d = B_d U_dn and Z_d = B_d U_ds are constructed from the matrices of singular vectors corresponding to null and non-null singular values of the SVD of P_d, U_dn and U_ds, respectively, for d = 1, 2, 3. Denoting as the diagonal matrix with the non-null singular values of the SVD of P_d in the main diagonal, the inverse of the covariance matrix G becomes the block-diagonal matrix: (8) where:

If data are spatially recorded over a rectangular coarse grid, the corresponding mixed model matrices are obtained as in Eq (7), but replacing the Box products □ by Kronecker products ⊗. The formulation of G⁻¹ remains the same.

Parameter estimation.

Once the ST-PCLM defined in (6) is in the GLMM framework, it is possible to estimate its parameters. This estimation procedure was presented by [20] in a spatial disaggregation context, and it involves two interrelated stages: (a) estimation of fixed coefficients and random effects (β and α); and (b) estimation of smoothing parameters (λ₁, λ₂, and λ₃). The penalized quasi-likelihood (PQL) methods of [32] are used for stage (a), and the restricted (or residual) maximum likelihood (REML, [32, 33]) is used for stage (b) as a numerical optimization criterion for smoothing parameter selection. Technical details are provided in [20] and, thus, we only describe here the necessary results.

For given values of λ₁, λ₂, and λ₃, the estimation of the fixed and random effects coefficients of the model (6) are obtained by maximizing the following approximate penalized log-likelihood: (9) where 1 denotes a vector of ones of length n ⋅ T_a and μ = C_st γ_f, with γ_f = e_f * exp(X β + Z α). Differentiation of Eq (9) with respect to β and α leads to the score equations: (10) where and are called ‘working’ mixed model matrices, since W = diag(μ) and Γ = diag(γ_f) change during the estimation procedure. Defining the working vector as: the solution of the score equations in (10) via Fisher scoring algorithm is expressed as the iterative solution of the system: (11) where b = G⁻¹ α. This yields to a modified version of the standard mixed model estimators: (12) (13) where: (15)

Conditioning on the estimates (12) and (13), the smoothing parameters λ₁, λ₂, and λ₃ can be estimated by maximizing the approximate REML (see [32, Eq. 13]): (16) where λ₁, λ₂, and λ₃ are involved in V through G. Therefore, optimal estimates for the ST-PCLM parameters are obtained by iteration between Eqs (12), (13) and (16), until convergence.

Once the ST-PCLM parameter estimates at convergence are obtained, we can derive standard errors for by using the Bayesian approximation of the variance-covariance matrix for (see [34] for further details). Thus, the approximate standard errors for are obtained by taking the square root of , which is obtained as:

We should note that the estimation procedure described above can be computationally inefficient if we are working with large datasets and the goal is to obtain estimates at a very fine scale. Furthermore, the direct creation of the matrices C_sf, X, and Z in model (7) can easily lead to storage problems. In the next Sections, we will provide solutions for the ST-PCLM estimation in terms of storage and efficiency (these solutions were not available in [20]).

Fast algorithm for spatio-temporal penalized composite link models.

In order to efficiently compute estimates of the smoothing parameters, we have adapted the SAP (Separation of Anisotropic Penalties) algorithm of [35] to the ST-PCLM context. To avoid possible storage problems, we have used the so-called GLAM (Generalized Linear Array Methods, [31, 36]) in the ST-PCLM setting. These methods also provide an efficient way to compute the matrix of cross-products required for the SAP algorithm, easing the computation time of model estimation.

Under the ST-PCLM approach and conditioning on the estimates given in Eqs (12) and (13), we can estimate λ₁, λ₂, and λ₃ by numerical maximization of the approximate REML in Eq (16). The usual algorithms used to approximate the solutions, such as the one proposed by [37] (extended to the generalized case by [38]), can only deal with situations in which the variance-covariance matrix is linear on the variance parameters. In the case of spatio-temporal models with anisotropic penalties, regression parameters are affected by more than one smoothing parameter, and so standard approaches can’t be used, since the corresponding variance-covariance matrix does not have the required form. [35] proposed the SAP algorithm which provides a numerical solution to REML estimates, but it is able to deal with models that have a precision matrix for the random effect vector that is linear in the inverse of the variance parameters. These precision matrices are common when penalized smooth models with anisotropic penalties are reformulated as (generalized) linear mixed models, where the smoothing parameters are seen as ratios of variance components, i.e., , for d = 1, 2, 3. Since we are working under a Poisson framework, the dispersion parameter, ϕ, is equal to 1. Thus, the problem is reduced to obtain estimates for the variance components , , and . The reformulation of the SAP algorithm for the ST-PCLM is described below.

Following [35] we can derive closed-form expressions, from the approximate REML, for the variance components , for d = 1, 2, 3. These estimates are given by: (17) where: (18) with N defined in Eq (15), and The non-null submatrices of each Λ_d were previously defined in Section. Here the inverse of the covariance matrix G in model (6) is written in terms of ’s and can be decomposed as , where the capital lambdas are defined above. The SAP algorithm for the ST-PCLM parameter estimation is given in Algorithm 1, which is an adaptation of the algorithm provided in [35, p. 945].

Algorithm 1 SAP algorithm for the ST-PCLM parameters estimation

Require: Convergence tolerances ν₁ and ν₂ (e.g., 1 × 10⁻⁶) and maximum number of iterations maxit₁ and maxit₂ (e.g., 100).

1: Set initial values for the mixed model coefficients β and α, and the variance components , , and (for example, with length q₁ q₂ q₃, with length (c₁ c₂ c₃−q₁ q₂ q₃), and ). Set k = 0.

2: for 1 to maxit₁ do

3: Given the current estimates for the mixed model coefficients, construct the matrix of weights W and the working vector z as follows: with .

4: for 1 to maxit₂ do

5: Given the current estimates for the variance components, obtain new estimates for β and α by solving the system in Eq (11). The resulting estimates are denoted as and , respectively.

6: Obtain new estimates for the variance components using Eq (17). The resulting estimates are denoted as , for d = 1, 2, 3.

7: Compare new variance component estimates with the previous ones, using the following convergence criterion:

8: If the convergence tolerance is achieved, break, otherwise set and repeat steps 5, 6, and 7 until convergence.

9: end for

10: Compute a new estimate for the fine-scale smooth trend vector using the fixed and random effects estimates obtained in the last iteration of step 5. The resulting vector is denoted as . Compare the new estimate with the previous one, using the following convergence criterion:

11: If the convergence tolerance is achieved, break, otherwise set k = k + 1 and repeat steps 2 to 11 until convergence.

12: end for

Although we have suggested setting the maximum number of iterations to 100, our experience is that the number of iterations needed to achieve convergence is much smaller. In general, we have observed that the maximum number of iterations to obtain optimal variance components is greater than the number of iterations to obtain optimal estimates for the linear predictor. In the analysis of Q-fever data, convergence was reached after 30 iterations.

We should note that we can efficiently compute the trace in Eq (18) by taking into account that G Λ_d G is a diagonal matrix. Thus, we only have to compute the diagonal of to obtain this trace. From [39, Eq. (5.3)], we have that: Therefore, the diagonal elements of matrix are obtained by the column-wise addition of:

An advantage of using the adapted SAP algorithm provided above is that we can directly compute the effective dimension (ED) of model (6). This model complexity measure is given by: (19) where each ed_d is computed from Eq (18). The first term on the right-hand side of Eq (19) corresponds to the dimension of the unpenalized (or fixed) part, whereas the second corresponds to the unpenalized (or random) part of the fitted model. To visualize the later statement, note that: where is the ‘hat matrix’ [40] of the unpenalized part of the fitted ST-PCLM (see Eq (13)).

Note that the matrix cross-products , , and (and its transpose) are involved in the computation of the variance component estimates. They also appear in the estimation of fixed and random effects coefficients, which are obtained by solving the system in Eq (11). These and other required matrix cross-products can be efficiently computed by adapting the so-called GLAM methods to the ST-PCLM setting, which we will show in the next Section.

GLAM methods for spatio-temporal penalized composite link models.

When we deal with estimating latent trends in multiple dimensions, we are susceptible to problems with storage and computational burden. In the case of data arranged in multidimensional grids, these problems can be overcome using the GLAM methods developed in [31, 36]. These methods are designed to avoid direct computation of matrix cross-products where Kronecker operations are involved, by using sequences of nested matrix operations. In this Section, we show the use of these methods in the ST-PCLM context.

First consider the matrix-by-vector products X β, Z α, and C_st γ_f that appear in line 3 of Algorithm 1. These expressions can be computed as: with , , , , where ρ and denote the rotated transform and the row-tensor product, respectively, defined in the S1 File. The symbol ≡ means that both sides have the same elements but in a different order. The matrices , , and , for k = 1, …, 7, are arrangements of the vectors β, γ, and α_k, respectively, with . The dimensions of these matrices correspond to the number of columns of the first matrix where ρ acts, times the number of columns of the second matrix where ρ acts (i.e., has dimension ncol(R₁) × ncol(x₃) = q₁ q₂ × q₃, has dimension ncol(C_s) × ncol(C_t) = m × T_f, and so on). Therefore, it holds that , , and , for k = 1, …, 7.

Next consider the matrix cross-products , , , (which is equal to ), , and that appear in line 5 in Algorithm 1 (i.e., in the system in Eq (11)). Note first that they can be reduced as: Therefore, we only need to compute C_st Γ x and C_st Γ Z. These expressions are obtained as follows:

Results

Once the ST-PCLM has been stated, we analyzed the data related to Q fever outbreaks in the Netherlands.

Weekly high-resolution smooth incidence maps

As mentioned above, we have focused on the distribution of Q fever incidence in the south of the Netherlands during 2009. Some of this data were analyzed in [41], the authors focused on the construction of a smooth incidence map at a fixed time, without considering the need to obtain predictions that were consistent with the aggregated data. More recently [42] focused on the spatial transmission of the disease.

Let’s remember that the data are aggregated at municipality and monthly levels, and that our goal is to obtain estimates of Q fever incidence at that fine grid and at each week of 2009 (i.e., disaggregate simultaneously in space and time) to obtain better insights of the evolution of the disease. The ST-PCLM approach, allows to visualize Q fever incidence at a finer spatio-temporal resolution, and also to incorporate fine-scale population information to the estimation of the latent process. Here we use two sources of information: 1) The Q fever counts that are available in the municipalities of the study area described in Fig 2b and months of 2009 (i.e., the initial spatial and temporal scales are municipalities and months, respectively), and 2) population measured at a fine spatial grid over the study area. Population information at fine spatial grids is available from the WorldPop project. The data containing the spatial distribution of population in 2009 in the Netherlands can be downloaded from https://www.worldpop.org/geodata/summary?id=42722.

Fig 3a shows a fine grid composed of 4871 regular cells of size 1000 m × 1000 m. Blue dots represent the spatial coordinates of the centroids of these cells, and Fig 3b shows the spatial distribution of the population on this fine grid, which is heterogeneous across municipalities. To set up the ST-PCLM formulation, we have used: centroids of the cells described in Fig 3a as fine-scale spatial coordinates, i.e., as x₁ and x₂, x₃ = (1, …, 53)′ (since 53 weeks were observed in 2009), second order penalties, 12 equally-spaced knots for the marginal cubic B-spline bases B₁ and B₂, and 8 equally-spaced knots for the marginal cubic B-spline basis B₃. We have assumed here that the population at the fine grid is constant throughout the time period; thus, e_f in model (6) is considered as a vector obtained by repeating the fine-scale population fifty three times. The elements of the spatial composition matrix are obtained using Eq (5), whereas the temporal composition matrix for this case has the following form: where Sunday is considered the first day of the week. As opposed to the spatial composition matrix, the matrix C_t has some entries that are fractions. This is because some months share parts of a specific week (for example, some days of week 14 belong to March and the others to April).

Download:

Fig 3. The map on the left shows the fine grid of 1000 × 1000 m cells in the study area shown in Fig 2b.

The map on the right shows the spatial distribution of the population on this fine grid. Source: Estimated population density per grid-cell from the WorldPop project in 2009, the Netherlands https://www.worldpop.org/geodata/summary?id=42722. a) Fine grid (cells of size 1000x1000 m). b) Population on grid.

https://doi.org/10.1371/journal.pone.0263711.g003

Fig 4 shows the resulting ST-PCLM Q fever incidence (per 100000 inhabitants) at the target fine spatial resolution, for six selected weeks. These incidences are obtained as . The evolution of the incidence varies across municipalities and weeks, where the highest incidences are observed mostly around week 19. Most of those weeks belong to April, May, and June, which have the largest number of Q fever outbreaks observed in 2009 (see Fig 1). Note also that most of the highest incidences in week 19 are spatially concentrated around the area that includes points A and C in Fig 4, which are located in the municipalities of Landerd and Heusden, respectively (see Fig 2b). Fig 5 shows the approximate standard error maps associated with Fig 4. As expected in a Poisson setting, larger variances are found in areas with higher incidence rates.

Download:

Fig 4. Smoothed Q fever incidence at a detailed spatio-temporal scale, resulting from the ST-PCLM approach, for six selected weeks.

https://doi.org/10.1371/journal.pone.0263711.g004

Download:

Fig 5. Approximate standard error maps associated with the smoothed Q fever incidence maps in Fig 4.

https://doi.org/10.1371/journal.pone.0263711.g005

From the previous ST-PCLM estimates, we can also visualize the disaggregated (weekly) temporal evolution of the Q fever disease at specifics spatial coordinates of the fine grid. Fig 6 shows the down-scaled smoothed temporal incidence (per week) at three specific locations A, B, and C, in the study area. We observe that the temporal evolution of the incidence at point B is constant and almost zero, whereas, at points A and C, the temporally smoothed incidence present a unimodal behaviour, where the peak is reached around week 19 (May). This is consistent with the summaries of the incidences given in Figs 1 and 2.

Download:

Fig 6. Weekly temporal evolution of Q fever incidence in three specific points (A, B, and C), spatially presented in Fig 4 on the high-resolution map at week 19.

https://doi.org/10.1371/journal.pone.0263711.g006

Simulation studies

In this Section, we present one of two simulation studies that have been carried out to examine the predictive performance of the model under different scenarios. In both of them, we have done spatial and temporal disaggregation, but the focus of each simulation exercise is different. The first aims to check the performance of the ST-PCLM approach under different degrees of spatial dependence. In the second, we will study the behaviour of the model under different levels of temporal disaggregation (results from this second study are available in the S1 File).

Simulation study 1.

Data are generated using the fine grid in Fig 3a as the spatial region of study, and the 53 weeks in a year as the disaggregated temporal scale. The simulation is conducted as follows:

The fine-resolution incidence vector is constructed based on the smoothed Q fever incidences obtained in the previous section. We denote these smoothed incidences as inc(u_k), where u_k, k = 1, …, K, with K = 4371 × 53 = 258163, represents the spatio-temporal coordinates at fine resolution. The different levels of spatial dependence are achieved by changing the values of the variance components that control the spatial term in the model, and (the variance component for the temporal term, , will remain fixed at the optimal value obtained in the fit). The values for the optimal variance components in the fit were: , , and . Based on these results we set 3 different scenarios:
- Scenario 1: Variances for the spatial component are those in the fit.
- Scenario 2: Variances for the spatial component are 100 times larger than those in the fit.
- Scenario 3: Variance for the spatial component are 1000 times smaller than those in the fit.
Calculation of the aggregated expected number of cases (over municipalities and months), μ_kl, k = 1, …, 72, l = 1, …, 12:
100 realizations of the number of cases in each municipality and month are generated through a random drawing from a Poisson distribution with mean parameter μ.

Fig 7 shows the incidences, inc_g(u_k), corresponding to week 19 (which is the one with the highest number of cases) used in the three scenarios of simulation study 1. They reflect the three different levels of smoothness obtained by increasing or decreasing the variance components that control the spatial effect in the model.

Download:

Fig 7. Plots of incidences (for one of the weeks) used in each simulated scenario.

https://doi.org/10.1371/journal.pone.0263711.g007

For all realizations l = 1, …, 100, the predicted incidences obtained from the ST-PCLM approach for each scenario g, with g = 1, 2, 3, were compared to the smoothed incidences inc_g(u_k). To evaluate the performance of the model we computed, for each scenario, the mean absolute error (MAE), the root mean squared error (RMSE): the correlation between the observed and predicted incidence, and the percent of grid cells with true incidence falling within 95% prediction intervals of the model. These metrics were averaged over 100 simulated data sets and they are summarized in Table 1.

Download:

Table 1. Performance comparison of the ST-PCLM approach in three different types of scenarios of simulation study 1.

Correlation coefficient average (avg) of the 100 replicates, average coverage % and mean absolute errors (MAE) and root mean squared error (RMSE) are also shown.

https://doi.org/10.1371/journal.pone.0263711.t001

The model performance criteria reported in Table 1 show good model performance in terms of prediction accuracy in all scenarios. All criteria showed that slightly worse predictions where obtained in scenario 2, this is mainly due to the fact that, for consistency, we have used the same size of B-spline basis for all scenarios (those used in the data analysis), but in the case of rapidly changing spatial patters (as is the case in scenario 2), a larger basis would be necessary to correctly capture the spatial effect, or adaptive P-splines [43] could be used; however, this approach is beyond of the scope of this paper.

In Fig 8 we provide further insight on the 95% coverage achieved in scenario 1 (similar plots for the other scenarios can be found in the S1 File). The lowest coverage over time (averaged over all pixels) was 85%, and was found at the extremes of the time interval (corresponding to the periods of lower incidence). The coverage for each pixel at the fine-grid resolution (averaged over time), shows that areas with lower coverage (60%–80%) correspond to locations with the highest number of cases. However, the coverage for these pixels is not constant over time, reaching almost 100% in weeks with low incidence rates. In scenario 2, the results are similar, although the coverage is lower (especially for a small number of pixels where the averaged coverage is between 20% and 30%), this is mainly due to the poor fitting for points with a sudden high incidence pick. The coverage in scenario 3 (the smoothest spatial trend) is nearly constant over the grid, ranging from 90% to 100%.

Download:

Fig 8. Percentage of cells in the grid with true incidence falling within 95% prediction in scenario 1.

On the left, averaged coverage per week over all cells in the grid, and on the right, averaged coverage per cell over all weeks.

https://doi.org/10.1371/journal.pone.0263711.g008

Taking into account the fact that we are predicting over more than 4000 grid cells and 53 weeks using aggregated data from 72 municipalities and 12 months, these results indicate a good predictive performance of the model.

Discussion

We have presented a novel model for the disaggregation of grouped data in both space and time, based on the spatio-temporal penalized composite link model approach. This framework was used to model Q fever counts (recorded in municipalities and months) to obtain Q fever incidence estimates over the fine grid and weeks. The model allows to obtain detailed trends in disease incidence, mortality risks, or any other vital rates at a desirable fine spatio-temporal resolution. Therefore, the resulting ST-PCLM outcomes can be displayed as a dynamic map. It also allows to include population information at a fine resolution in the estimation process. The flexibility of the model is provided by the use of B-splines, along with a penalty on the regression coefficients, and the link between the areas and the fine-resolution grid is achieved through the composition matrix. Our proposal can also address situations in which population information is recorded over small spatial units that are nested in coarser ones (for example, from municipalities to census tracts). In that case, the centroids of the small units can be used to represent the fine spatial scale.

Furthermore, the model allows the incorporation of covariates of interest (such as, for example, socio-economic, demographic, and environmental factors) in the ST-PCLM formulation to improve the estimation of the latent trend. They can be included at the aggregated level or at the fine-scale level (or at both levels simultaneously). The inclusion of covariates at the fine-scale level is immediate by adding them as columns in the design matrix X in (6) (if a linear relationship is assumed), or adding columns in Z if a non-linear relationship is expected. Details on how to include explanatory variables measured at the aggregated level can be found in [20].

It is important to acknowledge the use of GLAM methods in conjunction with the SAP algorithm, to avoid storage problems and to speed up computations. However, we are aware that the disaggregation of grouped data into a very detailed resolution could lead to increased computational load and storage problems. The sparsity of the marginal composition matrices can be exploited to deal with these issues (see, for example, [44]).

We have conducted two simulation studies to evaluate the prediction accuracy of the ST-PCLM. The first study aimed to check the performance of the model under different degrees of spatial dependence, and the second to asses the effect of different levels of temporal disaggregation. The overall performance of the model was good, although, as expected, it was affected by the number of units available at the aggregated level.

In the ST-PCLM formulation, we assume the spatial unit boundaries remain fixed over time. But it may happen that some of these boundaries change over the years. This issue is known in the statistical literature as the spatio-temporal misalignment problem. As future work, we plan to extend the ST-PCLM approach to handle this problem, where marginal composition matrices will play an important role. Furthermore, we can exploit the unique relationship between the penalties associated with basis and the covariance structure they yield to explore the use of other common spatio-temporal covariance matrices.

The implementation of our proposal and all data analyses were carried out using the statistical software R [45].

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0263711.s001

(PDF)

References

1. Waller LA, Gotway CA. Applied Spatial Statistics for Public Health Data. John Wiley & Sons, New York; 2004.
2. MacNab YC, Dean CB. Autoregressive spatial smoothing and temporal spline smoothing for mapping rates. Biometrics. 2001;57:949–956. pmid:11550949
- View Article
- PubMed/NCBI
- Google Scholar
3. Ugarte MD, Goicoa T, Militino AF. Spatio-temporal modeling of mortality risks using penalized splines. Environmetrics. 2010;21(3-4):270–289.
- View Article
- Google Scholar
4. Waller LA, Carlin BP, Xia H, Gelfand AE. Hierarchical Spatio-Temporal Mapping of Disease Rates. Journal of the American Statistical Association. 1997;92(438):607–617.
- View Article
- Google Scholar
5. Knorr-Held L. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine. 2000;19(17-18):2555–2567. pmid:10960871
- View Article
- PubMed/NCBI
- Google Scholar
6. Martínez-Beneito MA, López-Quilez A, Botella-Rocamora P. An autoregressive approach to spatio-temporal disease mapping. Statistics in Medicine. 2008;27(15):2874–2889. pmid:17979141
- View Article
- PubMed/NCBI
- Google Scholar
7. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(2):319–392.
- View Article
- Google Scholar
8. Schrödle B, Held L. Spatio-temporal disease mapping using INLA. Environmetrics. 2011;22(6):725–734.
- View Article
- Google Scholar
9. Ugarte MD, Adin A, Goicoa T, Militino AF. On fitting spatio-temporal disease mapping models using approximate Bayesian inference. Statistical Methods in Medical Research. 2014;23(6):507–530. pmid:24713158
- View Article
- PubMed/NCBI
- Google Scholar
10. Bauer C, Wakefield J, Rue H, Self S, Feng Z, Wang Y. Bayesian penalized spline models for the analysis of spatio-temporal count data. Statist Med. 2016;35(11):1848–1865. pmid:26530705
- View Article
- PubMed/NCBI
- Google Scholar
11. Gelfand AE, Diggle PJ, Fuentes M, Guttorp P. Handbook of Spatial Statistics. CRC Press, Boca Raton, Florida; 2010.
12. Eilers PCH. Ill-posed problems with counts, the composite link model and penalized likelihood. Statistical Modelling. 2007;7(3):239–254.
- View Article
- Google Scholar
13. Lee D.-J., Durban M. P-spline ANOVA-type interaction models for spatio-temporal smoothing. Statistical Modelling. 2011;11(1):49–69.
- View Article
- Google Scholar
14. Lee D.-J., Durbán M. Smooth-CAR mixed models for spatial count data. Computational Statistics and Data Analysis. 2009;53(8):2968–2979.
- View Article
- Google Scholar
15. Rodríguez-Álvarez MX, Durbán M, Lee D.-J., Eilers PHC. Spatio-temporal adaptive penalized splines with application to Neuroscience; 2016. Available from: http://arxiv.org/abs/1610.06860.
16. Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11(2):89–121.
- View Article
- Google Scholar
17. Berke O. Exploratory disease mapping: Kriging the spatial risk function from regional count data. International Journal of Health Geographics. 2004;3(18):–. pmid:15333131
- View Article
- PubMed/NCBI
- Google Scholar
18. Goovaerts P. Geostatistical analysis of disease data: Accounting for spatial support and population density in the isopleth mapping of cancer mortality risk using area-to-point Poisson kriging. International Journal of Health Geographics. 2006;5(52). pmid:17137504
- View Article
- PubMed/NCBI
- Google Scholar
19. Diggle PJ, Moraga P, Rowlingson B, Taylor BM. Spatial and spatio-temporal log-gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science. 2013;28(4):542–563.
- View Article
- Google Scholar
20. Ayma D, Durbán M, Lee D.-J., Eilers PHC. Penalized composite link models for aggregated spatial count data: A mixed model approach. Spatial Statistics. 2016;17:179–198.
- View Article
- Google Scholar
21. Allcroft DJ, Glasbey CA. A latent Gaussian Markov random-field model for spatiotemporal rainfall disaggregation. Journal of the Royal Statistical Society Series C: Applied Statistics. 2003;52(4):487–498.
- View Article
- Google Scholar
22. Prairie J, Rajagopalan B, Lall U, Fulp T. A stochastic nonparametric technique for space-time disaggregation of streamflows. Water Resources Research. 2007;43(3).
- View Article
- Google Scholar
23. Segond ML, Neokleous N, Makropoulos C, Onof C, Maksimovic C. Simulation and spatio-temporal disaggregation of multi-site rainfall data for urban drainage applications. Hydrological Sciences Journal. 2007;52(5):917–935.
- View Article
- Google Scholar
24. Schleiss M, Berne A. Stochastic space-time disaggregation of rainfall into dsd fields. Journal of Hydrometeorology. 2012;13(6):1954–1969.
- View Article
- Google Scholar
25. Bindhu VM, Narasimhan B. Development of a spatio-temporal disaggregation method (DisNDVI) for generating a time series of fine resolution NDVI images. ISPRS Journal of Photogrammetry and Remote Sensing. 2015;101:57–68.
- View Article
- Google Scholar
26. Ballas D, Rossiter D, Thomas B, Clarke G, Dorling D. Geography Matters: Simulating the local impacts of national social policies. Joseph Rowntree Foundation; 2005. 3. Available from: http://www.dannydorling.org/wp-content/files/dannydorling_publication_id1233.pdf.
27. Brown GL. Q Fever. British Medical Journal. 1973;2(5857):43–45. pmid:4735198
- View Article
- PubMed/NCBI
- Google Scholar
28. Roest HIJ, Tilburg JJHC, Van Der Hoek W, Vellema P, Van Zijderveld FG, Klaassen CHW, et al. The Q fever epidemic in the Netherlands: History, onset, response and reflection. Epidemiology and Infection. 2011;139(1):1–12. pmid:20920383
- View Article
- PubMed/NCBI
- Google Scholar
29. van der Hoek W, Dijkstra F, Schimmer B, Schneeberger PM, Vellema P, Wijkmans C, et al. Q fever in the Netherlands: An update on the epidemiology and control measures. Eurosurveillance. 2010;15(12):pii:19520. pmid:20350500
- View Article
- PubMed/NCBI
- Google Scholar
30. Thompson R, Baker RJ. Composite Link Functions in Generalized Linear Models. Applied Statistics. 1981;30(2):125.
- View Article
- Google Scholar
31. Eilers PHC, Currie ID, Durbán M. Fast and compact smoothing on large multidimensional grids. Computational Statistics & Data Analysis. 2006;50(1):61–76.
- View Article
- Google Scholar
32. Breslow NE, Clayton DG. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association. 1993;88(421):9.
- View Article
- Google Scholar
33. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58(3):545–554.
- View Article
- Google Scholar
34. Lin X, Zhang D. Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1999;61(2):381–400.
- View Article
- Google Scholar
35. Rodríguez-Álvarez MX, Lee D.-J., Kneib T, Durbán M, Eilers PHC. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Statistics and Computing. 2015;25(5):941–957.
- View Article
- Google Scholar
36. Currie ID, Durbán M, Eilers PHC. Generalized linear array models with applications to multidimensional smoothing. J R Statist Soc B. 2006;68(2):259–280.
- View Article
- Google Scholar
37. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72(358):320–338.
- View Article
- Google Scholar
38. Schall R. Estimation in generalized linear models with random effects. Biometrika. 1991;78(4):719–727.
- View Article
- Google Scholar
39. Harville DA. Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York; 1997.
40. Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall, London; 1990.
41. van der Hoek W, Van De Kassteele JVD, Bom B, DE Bruin AD, Dijkstra F, Schimmer B, et al. Smooth incidence maps give valuable insight into Q fever outbreaks in the Netherlands. Geospatial Health. 2013;7(1):127–134.
- View Article
- Google Scholar
42. De Koeijer AA, Hagenaars TJ, Van Leuken JPG, Swart AN, Boender GJ. Spatial transmission risk during the 2007- 2010 Q fever epidemic in the Netherlands: Analysis of the farm-to-farm and farm-toresident transmission. PLoS ONE. 2020;15(2):e0227491.
- View Article
- Google Scholar
43. Rodríguez-Álvarez MX, Durban M, Lee D.-J., Eilers PHC. On the estimation of variance parameters in non-standard generalised linear mixed models: application to penalised smoothing. Statistics and Computing. 2019;29(3):483–500.
- View Article
- Google Scholar
44. Bates D, Martin M. Package ‘Matrix’ Sparse and Dense Matrix Classes and Methods Author; 2015. Available from: https://cran.r-project.org/package=Matrix.
- View Article
- Google Scholar
45. Team R Development Core. A Language and Environment for Statistical Computing; 2018. Available from: http://www.r-project.org.

[ref1] 1. Waller LA, Gotway CA. Applied Spatial Statistics for Public Health Data. John Wiley & Sons, New York; 2004.

[ref2] 2. MacNab YC, Dean CB. Autoregressive spatial smoothing and temporal spline smoothing for mapping rates. Biometrics. 2001;57:949–956. pmid:11550949
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Ugarte MD, Goicoa T, Militino AF. Spatio-temporal modeling of mortality risks using penalized splines. Environmetrics. 2010;21(3-4):270–289.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref4] 4. Waller LA, Carlin BP, Xia H, Gelfand AE. Hierarchical Spatio-Temporal Mapping of Disease Rates. Journal of the American Statistical Association. 1997;92(438):607–617.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref5] 5. Knorr-Held L. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine. 2000;19(17-18):2555–2567. pmid:10960871
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref6] 6. Martínez-Beneito MA, López-Quilez A, Botella-Rocamora P. An autoregressive approach to spatio-temporal disease mapping. Statistics in Medicine. 2008;27(15):2874–2889. pmid:17979141
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref7] 7. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(2):319–392.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Schrödle B, Held L. Spatio-temporal disease mapping using INLA. Environmetrics. 2011;22(6):725–734.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Ugarte MD, Adin A, Goicoa T, Militino AF. On fitting spatio-temporal disease mapping models using approximate Bayesian inference. Statistical Methods in Medical Research. 2014;23(6):507–530. pmid:24713158
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref10] 10. Bauer C, Wakefield J, Rue H, Self S, Feng Z, Wang Y. Bayesian penalized spline models for the analysis of spatio-temporal count data. Statist Med. 2016;35(11):1848–1865. pmid:26530705
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref11] 11. Gelfand AE, Diggle PJ, Fuentes M, Guttorp P. Handbook of Spatial Statistics. CRC Press, Boca Raton, Florida; 2010.

[ref12] 12. Eilers PCH. Ill-posed problems with counts, the composite link model and penalized likelihood. Statistical Modelling. 2007;7(3):239–254.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Lee D.-J., Durban M. P-spline ANOVA-type interaction models for spatio-temporal smoothing. Statistical Modelling. 2011;11(1):49–69.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Lee D.-J., Durbán M. Smooth-CAR mixed models for spatial count data. Computational Statistics and Data Analysis. 2009;53(8):2968–2979.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref15] 15. Rodríguez-Álvarez MX, Durbán M, Lee D.-J., Eilers PHC. Spatio-temporal adaptive penalized splines with application to Neuroscience; 2016. Available from: http://arxiv.org/abs/1610.06860.

[ref16] 16. Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11(2):89–121.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref17] 17. Berke O. Exploratory disease mapping: Kriging the spatial risk function from regional count data. International Journal of Health Geographics. 2004;3(18):–. pmid:15333131
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref18] 18. Goovaerts P. Geostatistical analysis of disease data: Accounting for spatial support and population density in the isopleth mapping of cancer mortality risk using area-to-point Poisson kriging. International Journal of Health Geographics. 2006;5(52). pmid:17137504
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref19] 19. Diggle PJ, Moraga P, Rowlingson B, Taylor BM. Spatial and spatio-temporal log-gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science. 2013;28(4):542–563.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref20] 20. Ayma D, Durbán M, Lee D.-J., Eilers PHC. Penalized composite link models for aggregated spatial count data: A mixed model approach. Spatial Statistics. 2016;17:179–198.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref21] 21. Allcroft DJ, Glasbey CA. A latent Gaussian Markov random-field model for spatiotemporal rainfall disaggregation. Journal of the Royal Statistical Society Series C: Applied Statistics. 2003;52(4):487–498.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref22] 22. Prairie J, Rajagopalan B, Lall U, Fulp T. A stochastic nonparametric technique for space-time disaggregation of streamflows. Water Resources Research. 2007;43(3).
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref23] 23. Segond ML, Neokleous N, Makropoulos C, Onof C, Maksimovic C. Simulation and spatio-temporal disaggregation of multi-site rainfall data for urban drainage applications. Hydrological Sciences Journal. 2007;52(5):917–935.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref24] 24. Schleiss M, Berne A. Stochastic space-time disaggregation of rainfall into dsd fields. Journal of Hydrometeorology. 2012;13(6):1954–1969.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref25] 25. Bindhu VM, Narasimhan B. Development of a spatio-temporal disaggregation method (DisNDVI) for generating a time series of fine resolution NDVI images. ISPRS Journal of Photogrammetry and Remote Sensing. 2015;101:57–68.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref26] 26. Ballas D, Rossiter D, Thomas B, Clarke G, Dorling D. Geography Matters: Simulating the local impacts of national social policies. Joseph Rowntree Foundation; 2005. 3. Available from: http://www.dannydorling.org/wp-content/files/dannydorling_publication_id1233.pdf.

[ref27] 27. Brown GL. Q Fever. British Medical Journal. 1973;2(5857):43–45. pmid:4735198
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref28] 28. Roest HIJ, Tilburg JJHC, Van Der Hoek W, Vellema P, Van Zijderveld FG, Klaassen CHW, et al. The Q fever epidemic in the Netherlands: History, onset, response and reflection. Epidemiology and Infection. 2011;139(1):1–12. pmid:20920383
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref29] 29. van der Hoek W, Dijkstra F, Schimmer B, Schneeberger PM, Vellema P, Wijkmans C, et al. Q fever in the Netherlands: An update on the epidemiology and control measures. Eurosurveillance. 2010;15(12):pii:19520. pmid:20350500
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref30] 30. Thompson R, Baker RJ. Composite Link Functions in Generalized Linear Models. Applied Statistics. 1981;30(2):125.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref31] 31. Eilers PHC, Currie ID, Durbán M. Fast and compact smoothing on large multidimensional grids. Computational Statistics & Data Analysis. 2006;50(1):61–76.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref32] 32. Breslow NE, Clayton DG. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association. 1993;88(421):9.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref33] 33. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58(3):545–554.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref34] 34. Lin X, Zhang D. Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1999;61(2):381–400.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref35] 35. Rodríguez-Álvarez MX, Lee D.-J., Kneib T, Durbán M, Eilers PHC. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Statistics and Computing. 2015;25(5):941–957.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref36] 36. Currie ID, Durbán M, Eilers PHC. Generalized linear array models with applications to multidimensional smoothing. J R Statist Soc B. 2006;68(2):259–280.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref37] 37. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72(358):320–338.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref38] 38. Schall R. Estimation in generalized linear models with random effects. Biometrika. 1991;78(4):719–727.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref39] 39. Harville DA. Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York; 1997.

[ref40] 40. Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall, London; 1990.

[ref41] 41. van der Hoek W, Van De Kassteele JVD, Bom B, DE Bruin AD, Dijkstra F, Schimmer B, et al. Smooth incidence maps give valuable insight into Q fever outbreaks in the Netherlands. Geospatial Health. 2013;7(1):127–134.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref42] 42. De Koeijer AA, Hagenaars TJ, Van Leuken JPG, Swart AN, Boender GJ. Spatial transmission risk during the 2007- 2010 Q fever epidemic in the Netherlands: Analysis of the farm-to-farm and farm-toresident transmission. PLoS ONE. 2020;15(2):e0227491.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref43] 43. Rodríguez-Álvarez MX, Durban M, Lee D.-J., Eilers PHC. On the estimation of variance parameters in non-standard generalised linear mixed models: application to penalised smoothing. Statistics and Computing. 2019;29(3):483–500.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref44] 44. Bates D, Martin M. Package ‘Matrix’ Sparse and Dense Matrix Classes and Methods Author; 2015. Available from: https://cran.r-project.org/package=Matrix.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref45] 45. Team R Development Core. A Language and Environment for Statistical Computing; 2018. Available from: http://www.r-project.org.

Figures

Abstract

Introduction

Materials and methods

Q fever data

The spatio-temporal penalized composite link model

The model.

Mixed model representation.

Parameter estimation.

Fast algorithm for spatio-temporal penalized composite link models.

GLAM methods for spatio-temporal penalized composite link models.

Results

Weekly high-resolution smooth incidence maps

Simulation studies

Simulation study 1.

Discussion

Supporting information

S1 File.

References