Introduction

Plant and animal species—and as a consequence, their pathogens and pests—are translocated across the globe at an ever-increasing rate since the nineteenth century1. Most introductions of microorganisms and insects (notably plant pathogens and/or their arthropod vectors) are accidental, while a few are deliberate for biological control purposes2. Introductions of plant pathogens can have a dramatic impact on agricultural production and food security3,4 and being able to predict the fate of a newly introduced disease based on surveillance data is pivotal to anticipate control actions5.

Where and when alien organisms are successfully introduced are central questions for the study of biological invasions, to elucidate biotic and abiotic conditions favorable to the introduction and establishment and spread of invasive species6,7. Indeed, unraveling such conditions is a prerequisite to map the risk of invasion and to design an efficient surveillance strategy7 as initial phases are critical for the establishment success of introduced pathogens8,9. It is also during this phase that invaders are the easiest to control6,10. In addition, a precise knowledge of the dates and places of introduction is critical to accurately determine invader reproduction and dispersal parameters, as their estimated values are directly dependent on the time and distance between the actual introduction and the observations11,12. However, biological invasions by alien organisms are often reported several years after the initial successful introduction event13. Thus, monitoring data generally do not provide a clear picture of the date and place of introduction.

Many successful emergences of plant viruses have taken place worldwide in the last decades14,15, whereas some have only been observed punctually with no long term establishment of the pathogens16. As plant viruses are rapidly-evolving pathogens with small genomes and high mutation rates17, they may present measurably evolving populations18 over the time scale of decades. Thus, the spatio-temporal histories of invading species can sometimes be reconstructed from georeferenced and dated genomic data, with phylogeographic methodologies19,20,21 or population dynamic models embedding evolutionary processes. Over shorter time-scales, or when only abundance data are available, recent mechanistic modelling approaches have been proposed to infer the date and place of introduction of a single species along with other demographic parameters22. In many modeling approaches, the interactions of the new viruses or new strains with preexisting virus populations are not taken into consideration, even if it is known that synergism or competition between virus species or strains can affect their maintenance and spread23,24,25.

Mechanistic models are increasingly used in statistical ecology because, compared to purely correlative approaches, their parameters can directly inform on biological processes and life history traits. Among them, the reaction–diffusion framework is widely used in spatial ecology to model dynamic species distributions26,27. In such models, the diffusion coefficient is related to dispersal ability and the growth or competition coefficients may help in understanding the respective interactions between different species, variants or genotypes. Additionally, reaction–diffusion models can easily account for spatio-temporal heterogeneities28,29. A drawback of this type of approach is that the compartments that are modelled typically correspond to continuous population densities, which rarely match with observation data. There is therefore a challenge in connecting the solution of the model with complex data, such as noisy data, binary data, temporally and spatially censored data. Recent approaches have been proposed to bridge the gap between reaction–diffusion models and data (e.g.,30, in a framework known as mechanistic-statistical modelling31,32,33. An advantage of this method is that it allows to estimate simultaneously both the biological parameters and the date and place of introduction in a single framework22. Recently, this type of approach was applied to localize and date the invasion of South Corsica by Xylella fastidiosa, based on a single-species reaction–diffusion model and binary data11.

Our objective here is to propose a modelling framework to deal with multiple introductions by several invasive variants, in competition with a resident population, when observations provide knowledge on the relative proportions of each variant at some dates and places rather than absolute abundances of the different variants. We develop a reaction–diffusion mechanistic-statistical model applied to a genetic spatio-temporal dataset reporting the relative proportions of five genetic variants of watermelon mosaic virus (WMV, genus Potyvirus, family Potyviridae) in infections of commercial cucurbit fields. Our framework allows us to (1) estimate the dates and places of successful introduction of each emerging variant along with other ecological parameters, (2) reconstruct the invasion history of the emerging variants from their introduction sites, (3) detect competitive advantages of the emerging variants as compared to the resident population, and (4) predict the fate of the different genetic groups, in particular the takeover of the emerging variants over the resident population.

Material and methods

Data

Pathosystem

WMV is widespread in cucurbit crops, mostly in temperate and Mediterranean climatic regions of the world16. WMV has a wide host range including some legumes, orchids and many weeds that can be alternative hosts16. Like other potyviruses, it is non-persistently transmitted by at least 30 aphid species16. In temperate regions, WMV causes summer epidemics on cucurbit crops, and it can overwinter in several common non-cucurbit weeds when no crops are present16,34. WMV has been common in France for more than 40 years, causing mosaics on leaves and fruits in melon, but mostly mild symptoms on zucchini squash. Since 2000, new symptoms were observed in southeastern France on zucchini squash: leaf deformations and mosaics, as well as fruit discoloration and deformations that made them unmarketable. This new agronomic problem was correlated to the introduction of new molecular groups of WMV strains. At least four new groups have emerged since 2000 and they have rapidly replaced the native “classical” strains, causing important problems for the producers35. These new groups, hereafter “emerging strains” (ES) are significantly more related molecularly to worldwide strains than to any other isolates from the French populations36. As emphasised in35, this supports that the new group of emerging strains has arisen through introductions, mostly from Southeastern Asia, rather than through local differentiation.

In this study, we focus on the pathosystem corresponding to a classical strain (CS) and four emerging strains (ESk, \(k = 1, \ldots ,4\)) of WMV and their cucurbit hosts.

Study area and sampling

The study area, located in Southeastern France, is included in a rectangle of about 25,000 km2 and is bounded on the South by the Mediterranean Sea. Between 2004 and 2008, the presence of WMV had been monitored in collaboration with farmers, farm advisers and seed companies. Each year, cultivated host plants were collected in different fields and at different dates between May 1st and September 30th. In total, more than two thousand plant samples were collected over the entire study area. All plant samples were analyzed in the INRAE Plant Pathology Unit to confirm the presence of WMV and determine the molecular type of the virus strain causing the infection (see35 for detail on field and laboratory protocols). All infected host plants were cucurbits, mostly melon and different squashes (e.g., zucchini, pumpkins).

Observations

In the absence of individual geographic coordinates, all infected host plants were attributed to the centroid of the municipality (French administrative unit, median size about 10 km2) where they have been collected. Then for one date, one observation corresponded to a municipality in which at least one infected host plant was sampled. Table 1 summarizes for each year, the number of observations (i.e. number of municipalities), the number of infected plants sampled and the proportion of each WMV strain (CS, and ES1 to ES4) found in the infected host plants. Errors in assignment of virus samples to the CS or ES strains was negligible because of the large genetic distance separating them: 5 to 10% nucleotide divergence both in the fragment used in the study and in complete genomes35, also precluding the possibility of local jumps between groups by accumulation of mutations.

Table 1 Number of observations and corresponding proportions of classical and emerging strains.

Landscape

To approximate the density of WMV host plants over the study area, we used 2006 land use data (i.e. BD Ocsol 2006 PACA and LR) produced by the CRIGE PACA (http://www.crige-paca.org/) and the Association SIG-LR (http://www.siglr.org/lassociation/la-structure.html). Based on satellite images, land use is determined at a spatial resolution of 1/50,000 using an improved three-level hierarchical typology derived from the European Corine Land Cover nomenclature. Here we used the third hierarchical level of the BD Ocsol typology (i.e. 42 land use classes) to classify the entire study area in three habitats: (1) WMV-susceptible crops, (2) habitats unfavorable to WMV host plants (e.g. forests, industrial and commercial units…) and, (3) non-terrestrial habitat (i.e. water). The proportion of WMV-susceptible crops was then computed within all cells of a raster covering the entire study area, with a spatial resolution of \(1.4 \times 1.4\) km2. These proportions were used to approximate host plant density \(z\left( {\varvec{x}} \right)\), which was normalized, so that \(z\left( {\varvec{x}} \right) = 0\) corresponds to an absence of host plants and \(z\left( {\varvec{x}} \right) = 1\) to the maximum density of host plants (Fig. 1).

Figure 1
figure 1

Approximated density \(z\left( x \right)\) of the host plants in the study area. The density was normalized, so that \(z\left( x \right) = z\left( {x_{1} ,x_{2} } \right) = 0\) corresponds to an absence of cucurbit plants and \(z\left( x \right) = 1\) to the maximum density. The axes \(x_{1}\) and \(x_{2}\) correspond to Lambert93 coordinates (in km). The white regions are non-terrestrial habitats (water). Land use data were not available in the gray regions; the host plant density was then computed by interpolation.

Mechanistic-statistical model

The general modeling strategy is based on a mechanistic-statistical approach12,22,33. This type of approach combines a mechanistic model describing the dynamics under investigation with a probabilistic model conditional on the dynamics, describing how the measurements have been collected. This method that has already proved its theoretical effectiveness in determining dispersal parameters using simulated genetic data12 aims at bridging the gap between the data and the model for the determination of virus dynamics.

Here, the mechanistic part of the model describes the spatio-temporal dynamics of the virus strains, given the model parameters (demographic parameters, introduction dates/sites). This allows us to compute the expected proportions of the five types of virus strains (CS and ES1 to ES4) at each date and site of observation. The probabilistic part of the mechanistic-statistical model describes the conditional distribution of the observed proportions of the virus strains, given the expected proportions. Using this approach, it is then possible to derive a numerically tractable formula for the likelihood function associated with the model parameters.

Population dynamics

The model is segmented into two stages: (1) the intra-annual stage describes the dispersal and growth of the five virus strains during the summer epidemics on cucurbit crops, and the competition between them, during a period ranging from May 1st (noted \(t = 0\)) to September 30 (noted \(t = t_{f}\), \(t_{f} = 153\) days); (2) the inter-annual stage describes the winter decay of the different strains when no crops are present and the virus overwinters in weeds. We denote by \(c^{n} \left( {t,{\varvec{x}}} \right)\) and \(e_{k}^{n} \left( {t,{\varvec{x}}} \right)\) the densities of classical strain (CS) and emerging strains (ESk, \(k = 1, \ldots ,4\)), at position \({\varvec{x}}\) and at time \(t\) of year \(n.\)

Dynamics of the classical strain before the first introduction events

Before the introduction of the first emerging strain, the intra-annual dynamics of the population CS is described by a standard diffusion model with logistic growth: \(\partial_{t} c^{n} = D{\Delta }c^{n} + rc^{n} \left( {z\left( {\varvec{x}} \right) - c^{n} } \right)\). Here, \({\Delta }\) is the Laplace 2D diffusion operator (sum of the second derivatives with respect to coordinate). This operator describes uncorrelated random walk movements of the viruses, with the coefficient \(D\) measuring the mobility of the viruses (e.g.,26,37). The term \(r z\left( {\varvec{x}} \right)\) is the intrinsic growth rate (i.e., growth rate in the absence of competition) of the population, which depends on the density of host plants \(z\left( {\varvec{x}} \right)\) and on a coefficient \(r\) (intrinsic growth rate at maximum host density). Under these assumptions, the carrying capacity at a position \({\varvec{x}}\) is equal to \(z\left( {\varvec{x}} \right)\), which means that the population densities are expressed in units of the maximum host population density. The model is initialized by setting \(c^{1980} \left( {0,{\varvec{x}}} \right) = (1 - m_{c} ) z\left( {\varvec{x}} \right)\), where \(m_{c}\) is the winter decay rate of the CS (see the description of the inter-annual stage below). In other terms, we assume that the CS density is at the carrying capacity in 1979, i.e., 5 years after its first detection and 20 years before the first detections of ESs38.

Introduction events

The ESs are introduced during years noted \(n_{k} \ge 1981\), at the beginning of the intra-annual stage (other dates of introduction within the intra-annual stage would lead—at most—to a one-year lag in the dynamics). Their densities are \(0\) before introduction: \(e_{k}^{n} = 0\) for \(n < n_{k}\). Once introduced, the initial density of any ES is assumed to be 1/10th of the carrying capacity at the introduction point (other values have been tested without much effect, see Supplementary Fig. S1), with a decreasing density as the distance from this point increases:

$$e_{k}^{{n_{k} }} \left( {0,x} \right) = \frac{{z\left( {\varvec{x}} \right)}}{10}\exp \left( { - \frac{\|{{\varvec{x}} - {\varvec{X}}_{{\varvec{k}}}\|^{2} }}{{2\sigma^{2} }}} \right),$$

where \({\varvec{X}}_{{\varvec{k}}}\) is the location of introduction of the strain \(k.\) In our computations, we took \(\sigma = 5\) km for the standard deviation.

Intra-annual dynamics after the first introduction event

Intra-annual dynamics were described by a neutral competition model with diffusion (properties of this model have been analyzed in [54]):

$$\left\{ {\begin{array}{*{20}c} {\partial_{t} c^{n} \left( {t,x} \right) = D\Delta c^{n} + rc^{n} \left( {z\left( {\varvec{x}} \right) - c^{n} - \mathop \sum \limits_{i = 1}^{4} e_{i}^{n} \left( {t,{\varvec{x}}} \right)} \right)} \\ {\partial_{t} e_{k}^{n} \left( {t,x} \right) = D\Delta e_{k}^{n} + re_{k}^{n} \left( {z\left( {\varvec{x}} \right) - c^{n} - \mathop \sum \limits_{i = 1}^{4} e_{i}^{n} \left( {t,{\varvec{x}}} \right)} \right)} \\ \end{array} } \right.,$$

for \(t = 0 \ldots t_{f}\) and for all introduced emerging strains, i.e. all \(k\) such that \(n \ge n_{k} .\) We assume reflecting boundary conditions, meaning that the population flows vanish at the boundary of the study area, due to truly reflecting boundaries (e.g., sea coast in the southern part of the site) or symmetric inward and outward fluxes26. In addition, in order to limit the number of unknown parameters, and in the absence of precise knowledge on the differences between the strains, we assume here that the diffusion, competition and growth coefficients are common to all the strains during the intra-annual stage (see the discussion for more details on this assumption).

Inter-annual dynamics

The population densities at time \(t = 0\) of year \(n\) are connected with those of year \(n - 1,\) at time \(t = t_{f} ,\) through the following formulas:

$$\left\{ {\begin{array}{*{20}c} {c^{n} \left( {0,{\varvec{x}}} \right) = \left( {1 - m_{c} } \right)c^{n - 1} \left( {t_{f} ,{\varvec{x}}} \right) \hbox{ for } n \ge 1981} \\ {e_{k}^{n} \left( {0,{\varvec{x}}} \right) = \left( {1 - m_{e} } \right)e_{k}^{n - 1} \left( {t_{f} ,{\varvec{x}}} \right) \hbox{ for }n \ge n_{k} + 1} \\ \end{array} } \right.$$

with \(m_{c}\) and \(m_{e}\) the winter decay rates of the CS and ESs strains (note that \(m_{e}\) is common to all of the ESs). Estimation of CS and ES decay rates provides an assessment of the competitive advantage of one type of strain vs the other.

Numerical computations

The intra-annual dynamics were solved using Comsol Multiphysics time-dependent solver, which is based on a finite element method (FEM). The triangular mesh which was used for our computations is available as Supplementary Fig. S2.

Probabilistic model for the observation process

During the years \(n = 2004, \ldots ,2008\), \(I_{n}\) observations were made (see Section Observations above and Table 1). They consist in counting data, that we denote by \(C_{i}\) and \(E_{k,i}\) for \(k = 1, \ldots ,4\) and \(i = 1, \ldots ,I_{n}\), corresponding to the number of samples infected by the CS and ESs strains, respectively, at each date of observation and location \(\left( {t_{i} ,{\varvec{x}}_{i} } \right)\). Note that these dates and locations depend on the year of observation \(n\). More generally, the above quantities should be noted \(C_{i}^{n} , E_{k,i}^{n} , t_{i}^{n} , {\varvec{x}}_{i}^{n} ;\) for simplicity, the index \(n\) is omitted in the sequel, unless necessary.

We denote by \(V_{i} = C_{i} + \mathop \sum \nolimits_{k = 1}^{4} E_{k,i}\) the total number of infected samples observed at \(\left( {t_{i} ,{\varvec{x}}_{i} } \right)\). The conditional distribution of the vector \(\left( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} } \right)\), given \(V_{i}\) can be described by a multinomial distribution \({\mathcal{M}}\left( {V_{i} ,{\varvec{p}}_{i} } \right)\) with \({\varvec{p}}_{i} = \left( {p_{i}^{c} ,p_{i}^{{e_{1} }} ,p_{i}^{{e_{2} }} ,p_{i}^{{e_{3} }} ,p_{i}^{{e_{4} }} } \right)\) the vector of the respective proportions of each strain in the population at \(\left( {t_{i} ,{\varvec{x}}_{i} } \right)\). We chose to work conditionally to \(V_{i}\) because the sample sizes are not related to the density of WMV.

Statistical inference

Unknown parameters

We denote by \({{\varvec{\Theta}}}\) the vector of unknown parameters: the diffusion coefficient \(D,\) the intrinsic growth rate at maximum host density \(r\), the winter decay rates \((m_{c} , m_{e} )\) and the locations \((x_{k} \in {\mathbb{R}}^{2}\)) and years \((n_{k}\)) of introduction, for \(k = 1, \ldots ,4.\) Thus \({{\varvec{\Theta}}} \in {\mathbb{R}}^{16} .\)

Computation of a likelihood

Given the set of parameters \({{\varvec{\Theta}}}\), the densities \(c^{n} \left( {t,{\varvec{x}}|{{\varvec{\Theta}}}} \right)\) and \(e_{k}^{n} \left( {t,{\varvec{x}}|{{\varvec{\Theta}}}} \right)\) can be computed explicitly with the mechanistic model for population dynamics. Thus, at a given year \(n\), at \(\left( {t_{i} ,x_{i} } \right)\), the parameter \({\varvec{p}}_{i}\) of the multinomial distribution \({\mathcal{M}}\left( {V_{i} ,{\varvec{p}}_{i} } \right)\) writes:

$$p_{i}^{c} \left( {{\varvec{\Theta}}} \right) = \frac{{c^{n} \left( {t_{i} ,{\varvec{x}}_{i} |{{\varvec{\Theta}}}} \right)}}{{c^{n} \left( {t_{i} ,{\varvec{x}}_{i} |{{\varvec{\Theta}}}} \right) + \mathop \sum \nolimits_{i = 1}^{4} e_{i}^{n} \left( {t_{i} ,{\varvec{x}}_{i} {|}{{\varvec{\Theta}}}} \right)}}, p_{i}^{{e_{k} }} \left( {{\varvec{\Theta}}} \right) = \frac{{e_{k}^{n} \left( {t_{i} ,{\varvec{x}}_{i} |{{\varvec{\Theta}}}} \right)}}{{c^{n} \left( {t_{i} ,{\varvec{x}}_{i} |{{\varvec{\Theta}}}} \right) + \mathop \sum \nolimits_{i = 1}^{4} e_{i}^{n} (t_{i} ,{\varvec{x}}_{i} |{{\varvec{\Theta}}})}}.$$

The probability \(P(C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} |{{\varvec{\Theta}}},{\text{V}}_{{\text{i}}} )\) of the observed outcome \(C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i}\) is then

$$P\left( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} {|}{{\varvec{\Theta}}},{\text{V}}_{{\text{i}}} } \right) = \frac{{\left( {V_{i} } \right)!}}{{C_{i} ! \mathop \prod \nolimits_{k = 1}^{4} E_{k,i} !}}\left( {p_{i}^{c} \left( {{\varvec{\Theta}}} \right)} \right)^{{C_{i} }} \mathop \prod \limits_{k = 1}^{4} (p_{i}^{{e_{k} }} \left( {{\varvec{\Theta}}} \right))^{{E_{k,i} }} .$$

Assuming that the observations during each year and at each date/location are independent from each other conditionally on the virus strain proportions, we get the following formula for the likelihood:

$${\mathcal{L}}\left( {{\varvec{\Theta}}} \right) = \mathop \prod \limits_{n = 2004}^{2008} \mathop \prod \limits_{{i = 1, \ldots , I_{n} }} P\left( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} {|}{{\varvec{\Theta}}},{\text{V}}_{{\text{i}}} } \right).$$

A priori constraints on the parameters

By definition and for biological reasons, the parameter vector \({{\varvec{\Theta}}}\) satisfies some constraints. First, \(D \in \left( {10^{ - 4} ,10} \right){\text{ km}}^{2} /{\text{day}}\), \(r \in \left( {0.1,1} \right) {\text{day}}^{ - 1} ,\) and \(m_{c} , m_{e} \in \left\{ {0,0.1,0.2, \ldots ,0.9} \right\},\) (see Supplementary Note S7 for a biological interpretation of these values). Second, we assumed that the locations of introductions \({\varvec{X}}_{{\varvec{k}}}\) belong to the study area. To facilitate the estimation procedure, the points \({\varvec{X}}_{{\varvec{k}}}\) were searched in a regular grid with 20 points (see Supplementary Fig. S3), and the dates of introduction \(n_{k}\) were searched in \(\left\{ {1985,1990,1995,2000} \right\}.\)

Inference procedure

Due to the important computation time (4 min in average for one simulation of the model on an Intel(R) Core(R) CPU i7-4790 @ 3.60 GHz), we were not able to compute an a posteriori distribution of the parameters in a Bayesian framework. Rather, we used a simulated annealing algorithm to compute the maximum likelihood estimate (MLE), i.e., the parameter \({{\varvec{\Theta}}}^{*}\) which leads to the highest log-likelihood. This is an iterative algorithm, which constructs a sequence \(({{\varvec{\Theta}}}_{j} )_{j \ge 1}\) converging in probability towards a MLE. It is based on an acceptance-rejection procedure, where the acceptance rate depends on the current iteration \(j\) through a "cooling rate" (\(\alpha )\). Empirically, a good trade-off between quality of optimization and time required for computation (number of iterations) is obtained with exponential cooling rates of the type \(T_{0} \alpha^{j}\) with \(0 < \alpha < 1\) and some constant \(T_{0} \gg 1\) (this cooling schedule was first proposed in= 39 = 39). Too rapid cooling (\(\alpha \ll 1\)) results in a system frozen into a state far from the optimal one, whereas too slow cooling (\(\alpha \approx 1\)) leads to important computation times due to very slow convergence. Here, we ran \(6\) parallel sequences with cooling rates \(\alpha \in \left\{ {0.995,0.999,0.9995} \right\}\). For this type of algorithm, there are no general rules for the choice of the stopping criterion [HenJac03], which should be heuristically adapted to the considered optimization problem. Here, our stopping criterion was that \({{\varvec{\Theta}}}_{j}\) remained unchanged during 500 iterations. The computations took about 100 days (CPU time).

Confidence intervals and goodness-of-fit

To assess the model’s goodness-of-fit, 95% confidence regions were computed for the observations \(\left( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} } \right)\) at each date/location \(\left( {t_{i} ,{\varvec{x}}_{i} } \right),\) and for each year of observation. The confidence regions were computed by assessing the probability of each possible outcome of the observation process, at each date/location, based on the computed proportions \({\varvec{p}}_{i} = \left( {p_{i}^{c} ,p_{i}^{{e_{1} }} ,p_{i}^{{e_{2} }} ,p_{i}^{{e_{3} }} ,p_{i}^{{e_{4} }} } \right)\), corresponding to the output of the mechanistic model using the MLE \({{\varvec{\Theta}}}^{\user2{*}}\) and given the total number of infected samples \(V_{i}\). Then, we checked if the observations belonged to the 95% most probable outcomes.

Results

Convergence and goodness-of-fit

As expected, the highest likelihood was obtained with a slow cooling rate (\(\alpha = 0.9995).\) The corresponding MLE, denoted by \({{\varvec{\Theta}}}^{*} = \left( {D^{*} ,r^{*} ,m_{c}^{*} ,m_{e}^{*} ,{\varvec{X}}_{1}^{\user2{*}} ,{\varvec{X}}_{2}^{\user2{*}} ,{\varvec{X}}_{3}^{\user2{*}} ,{\varvec{X}}_{4}^{\user2{*}} ,n_{1}^{*} ,n_{2}^{*} ,n_{3}^{*} ,n_{4}^{*} } \right)\) is presented in Table 2. In average, 96% of the observations fell within the 95% confidence regions, indicating that the model fits the data well (277 observations over a total of 289; 94% in 2004, 98% in 2005, 96% in 2006, 96% in 2007 and 95% in 2008). We also note that the likelihood function is peaked at \({{\varvec{\Theta}}}^{*}\) in the sense that any perturbation in one component of \({{\varvec{\Theta}}}^{*}\) leads to lower likelihood (see Supplementary Note S8 for likelihood-ratio based confidence intervals and Supplementary Fig. S4 for more details on the profile of the likelihood function). This suggests that the MLE \({{\varvec{\Theta}}}^{\user2{*}}\) is close to the actual maximizer of the likelihood function.

Table 2 Maximum likelihood estimates.

Parameter values

As shown in Table 2, the MLE corresponds to the same date of introduction for ES1, ES2, ES3 whereas ES4 has been introduced five years later. Note that a same date of introduction does not mean a same date of detection: depending on local conditions, some strains may establish and spread faster than others (as observed below for ES4).

Regarding the sites of introduction, the MLE indicates that ES1 and ES2 have been introduced at the northeastern corner of the study area, ES3 in the northwest and ES4 in the southwest (see the white crosses in the 2004 panel of Fig. 2; see also Supplementary Fig. S3 for more details on the corresponding likelihood). Three introduction points (ESs 1, 2 and 3) were estimated at the edge of the study site, indicating that the introductions may have occurred outside of the study area.

Figure 2
figure 2

Proportions of the classical and emerging strains in the landscape: data and simulations. The colors of the shaded regions indicate which strain is the most prevalent. The red regions correspond to the CS strain; light blue and blue: ES1, ES2 (these two strains have the same density, only ES1 is represented); green: ES3; pink: ES4. The pie charts describe the relative proportions of the strains found in the data (same color legend). The white crosses on the 2004 panel represent the estimated sites of introduction. The simulation results presented here correspond to the middle of the intra-annual stage (2nd week of June), and were obtained with the MLE \(\varvec{\Theta}^{*}\).

Regarding the biological parameters, the winter decay rate of the CS strain was much higher than that of the ESs strains (0.5 vs. 0), reflecting a high competitive advantage of the ESs. The value \(D^{*} = 0.44\,{\text{km}}^{2}\) per day of the diffusion coefficient indicates that a virus travels about \(1.2\) km per day in average during the growing season of cucurbit crops (see Supplementary Note S7). The growth rate of \(0.31\,{\text{day}}^{ - 1}\) corresponds to an increase by a factor \(e^{0.31} \approx 1.3\) each day, in the absence of competition.

Strain distributions

Figure 2 depicts the most prevalent strains at each position in the landscape, during four of the five years of observation (2008 is presented in Supplementary Fig. S5), obtained by solving our model with the MLE \({{\varvec{\Theta}}}^{*}\), together with the data. We graphically note a good agreement between the positions of the observed strains and the distributions obtained with the model: ES1 and ES2 tend to be distributed in the eastern part of the study area, ES3 in the northwestern part and ES4 in the southwestern part, while the CS strain tends to be progressively confined to the central part of the study area. In 2007, ES3 seems to be more prevalent according to the model than suggested by the observations, probably due to its introduction site, which is far from the observation sites (see the first panel in Fig. 2). Note that, with the deterministic framework used here, as ES1 and ES2 share the same date and position of introduction, and the same parameter values, their distributions are completely equal; thus only ES1 is represented in the figures.

The spatial distributions of the different strains at each year where one of the emerging strains becomes locally more prevalent are depicted in Fig. 3. Although ESs 1, 2 and 3 have been introduced at the same date, their dynamics are influenced by local conditions: ESs 1 and 2 become the most prevalent at least in one part of the study area in 1996 (6 years after their introduction), ES3 in 2001 (11 years after introduction) and ES4 in 2002 (7 years after its introduction). Thus, despite the neutrality assumption, the heterogeneity of the landscape leads to different durations of the establishment stage. The full timeline of the dynamics of the different strain proportions in the landscape, from the first estimated introduction date of an emerging strain (1990) to 2019 is available as Supplementary Fig. S6. Since 2008, due to their competitive advantage (modelled here as a lower winter decay rate), the ESs replaced the CS, which is not anymore the most prevalent strain, whatever the position in the study area, 18 years after the first introduction. Before saturation, the spread rates of the ESs are about 5 km/year (estimated as the slope of \(\sqrt {invaded\, area/\pi }\), the invaded area corresponding to virus densities > carrying capacity/100). Then, the distribution of the ESs remains almost at equilibrium until the last year of simulation, which is a consequence of the neutrality assumption (equal fitness of all the ESs) (Fig. 3; last panel).

Figure 3
figure 3

Simulated proportions of the classical and emerging strains in the landscape: before and after the observation window. The simulation results presented here correspond to the middle of the intra-annual stage (2nd week of June), and were obtained with the MLE \({\Theta }^{*}\). The colors of the shaded regions indicate which strain is the most prevalent. The red regions correspond to the classical strain; blue: ES1, ES2 (these two strains have the same density, only ES1 is represented); green: ES3; pink: ES4.

Average proportions in the study area and effect of the CS on the ESs

To get a quantitative insight into the replacement of the CS by the ESs, we computed the relative global proportion of each strain by integrating the simulation results (with the parameters corresponding to the MLE \({{\varvec{\Theta}}}^{\user2{*}}\)) over the study area (Fig. 4, panel (A)). Before the first introduction in 1990, the classical strain represents 100% of the infections. In 2010, it represents only 10% of the infections. This decline, which was already visible in the 2004–2008 data34,35, is well-captured by the model, though with a slight advance. These discrepancies between the predicted proportions and the data are probably due to the positions of the observation sites, which are concentrated at the center of the domain, where the CS is more prevalent (see Fig. 2). In order to understand the effect of the presence of a resident CS strain on the emerging ESs strains, we compared the dynamics presented in Fig. 4, panel (A) with a hypothetical scenario describing the dynamics of the ESs in the absence of CS. For this, we used the MLE \({{\varvec{\Theta}}}^{*} ,\) to simulate the hypothetical dynamics of the ESs assuming that the CS density is 0. The results are depicted in Fig. 4, panel (B). We observe a very fast convergence to an equilibrium, compared to the situation where the CS is present. Additionally, the last introduced ES (ES4) cannot establish, and ES3 which was confined in an unfavorable region in the presence of the CS, reaches more favorable regions, leading to a higher proportion. Thus, the competition with the CS alters the outcome of the competition between the ESs, and seems to promote the diversity of the ESs by slowing down the overall dynamics.

Figure 4
figure 4

Estimated average proportions of the classical and emerging strains in the study area. (A) The plain lines correspond to the simulated proportions and the red crosses correspond to the proportions of CS in the data. (B) Simulated proportions of the ESs obtained by assuming that the CS is absent. In both cases, the parameter values correspond to the MLE \(\varvec{\Theta }^{*}\). Note that the curves corresponding to the ESs 1 and 2 are superimposed.

Discussion

In this work, we developed a reaction–diffusion model to describe the spatial dynamics of invasion of a resident population inhabiting a spatially structured environment by newly introduced variants. Using a mechanistic-statistical framework, we confronted the model to a dataset recording the invasion by several emerging genetic variants of a resident population of WMV and succeeded in (1) estimating the dates and places of successful introduction of each emerging variant as well as parameters related to growth and dispersal, (2) reconstructing the invasion by the new variants from their introduction sites, (3) establishing a competitive advantage of the new variants as compared to the resident population and (4) predicting the fate of each variant. Simulations with the optimal parameter values showed an adequate fit, proving that the model is able to reproduce the observed spatial dynamics despite the strong mechanistic constraints of the model structure and the strong spatial censorship of the dataset, i.e. missing data. We used fraction frequencies in the dataset rather than counting data. We argue that such observations are more robust to heterogeneities in the sampling conditions as they do not require a standardized observation protocol. Similarly, frequency data should be more robust to heterogeneities in the susceptibility among host genotypes, provided that all of the strains are influenced in the same manner. As stated by40, abundance data can only be used if the count-proportion, i.e. the ratio between expected count and population size, can safely be assumed to be constant, or if factors affecting variation in the count-proportion can be identified and then accommodated through parametric modeling.

The estimations suggest that three of the four emerging strains have been introduced at approximately the same date, while the fourth one was introduced 5 years later (the model considered only 5-yr intervals of introductions because of constraints on computation time). Despite the neutrality assumption that we made between the emerging strains, we observed different durations of the establishment stage: while ESs 1 and 2 became locally the most prevalent strains only 6 years after their introduction, ES3 displayed delayed dynamics since it became locally the most prevalent strain 11 years after its introduction. In comparison, ES4 was introduced 5 years later than ES3 but became locally the most prevalent strain at about the same date. The low prevalence of the ES3 in the dataset could be explained by a lower fitness of this strain, e.g. higher winter decay rate, lower growth rate or weaker competitivity. Our results indicate that this pattern can also be observed with a neutrality assumption, as a result of the joint effects of the local composition of the landscape, and of the position of the sampling sites, far away from the introduction site. Indeed, in the area where ES4 was first observed, cucurbits crops are very frequent with high connectivity between crops, whereas the area where ES3 was found is patchier. Similarly, our results indicate that the overall prevalence of the CS strain in the study area has been slightly overestimated in the data, due to sampling sites concentrated in the regions where it is indeed the most prevalent strain. The reconstructed dynamics of the five strains therefore underline the importance of estimating jointly the places and dates of the introduction and the other ecological parameters as well as the importance of considering the spatial structure of the sampling design.

Based on WMV-infected samples collected in Southeastern France for more than 30 years, ES1 was first detected in 199938 whereas the other ESs were observed in 2002–200435, i.e. 9–14 years after the estimated introduction dates. Such a lapse between the introduction of a plant pathogen and its first detection is consistent with estimations obtained for other plant viruses21,41,42,43. ESs strains have been detected in several European and Mediterranean countries (16), and in the USA44, in the few years following their description in France, and their prevalence in these countries seems to increase even if few time series data are available. The reasons for these almost simultaneous emergences in distant countries and variable environments are not fully understood. WMV being considered so far as not seed-transmitted, the ESs strains have probably been disseminated through long-distance exchanges of plant material35.

In addition to the dates and places of successful introduction, our model provides estimation of ecological parameters in natura. In particular, the diffusion parameter, measuring the mobility of the viruses (or, more precisely, of their aphid vectors) was estimated, leading to a value \(0.44 {\text{km}}^{2}\) per day for WMV. Among plant viruses, there are few estimates of diffusion coefficients, and most estimate rather focus on the speed of range expansion, which can be more directly derived from observations. For instance, an average speed of 33 km/yr and 13 km/yr was estimated for the leafhopper-transmitted wheat streak virus45 and the whitefly-transmitted East African cassava mosaic virus46 respectively, to be compared with the spread rate of 5 km/year found here. For non-persistently aphid-transmitted viruses like WMV and the other potyviruses, the insect remains viruliferous after probing on an infected plant for only a few minutes to hours16, and the dispersal distance is supposed to be limited, even if in exceptional climatic conditions, the potyvirus maize dwarf mosaic virus was transmitted by viruliferous aphids over more than 1000 km47. Estimations for the potyvirus plum pox virus indicated that 50% of the infectious aphids leaving an infected plant land within about 90 m, while about 10% of flights terminate beyond 1 km48. Here, a diffusion coefficient of 0.44 km2 per day corresponds to a mean dispersal distance of \(\sqrt {\pi D} \approx 1.2\) km after 1 day (see Supplementary Note S7), which seems in agreement with these data.

Another critical parameter that was estimated is the winter decay rate. Most studies focus on the epidemic period but off-season dynamics can be crucial to understand demography and genetic diversity34,49. We assumed here that emerging strains differed from the resident population only through this parameter. Indeed, no obvious differences in host range, including both cultivated hosts and weeds, have been found between CS and ES strains, whereas ESs strains were found to be better transmitted than CS ones from some weeds infected by both CS and ESs50. This could contribute to more efficient transfer from weeds to crops at the beginning of the growing season, leading to a lower winter decay of ESs strains. In our modelling approach we found that ESs strains were able to invade the resident population because of a lower decay during winter. Nevertheless, as all of the other parameters in the model of population dynamics have been set to the same value for the CS and the ESs, all of the competitive advantage of the ESs can be expressed only through the decay rates, explaining these differences, and the unrealistic 0-decay rate of the ESs. The unrealistic null winter decay rate that we found for the ESs suggests that, contrarily to our assumptions, competitive advantage of ESs probably occurs also during summer. Consistently, there is no efficient cross-protection between CS and ESs strains50, but23 found that superinfection by ESs strains of a plant already infected by CS is easier than the opposite situation. In our model, we also make a neutrality assumption between the ESs that differ only by their introduction dates/sites. This assumption may not completely reflect the complexity of the interactions between viral strains and the biological variability between and within molecular groups50. Our model could be extended to consider non-neutral competition between the strains and heterogeneous competitive advantages depending on the host genotype. However, such an approach would bring in a new difficulty in that it would involve a larger number of unknown parameters leading to identifiability issues and underdetermination of the model by the data. As is often the case, there is a trade-off between searching for a more realistic description and models that can be trusted because they still match the data satisfactorily with a small set of parameters to identify.

In 2008, the ESs have replaced the CS, which was not anymore the most prevalent strain, whatever the position in the study area, 18 years after the first ESs introduction. Moreover, this competitive advantage of the ESs is expected to lead to the total replacement of the CS by the ESs in about 25 years, i.e. in 2015. These results are consistent with current knowledge: new observations carried out in 2016 and 2017 showed than the classical CS strain is no more detectable51. Besides the disappearance of CS strains, the surveys performed in 2016–2017 revealed a complex and dynamic situation that fitted partially with the model. ES3 was detected in only one location in the southwestern part of the study area, confirming its low dispersal and probable low fitness. As predicted by the model, ES1 and ES2 were present in the Eastern part of the area and ES4 was present in all the study area. However, it was found to be more prevalent than ES1 and ES2 even in the Eastern part, suggesting that its fitness is higher than ES1 and ES2. New variants, not detected in 2004–2008, were also observed in 2016–2017, and some of them presented a high prevalence in all the area. Deep sequencing of two genomic regions revealed, contrary to the 2004–2008 situation52, a high prevalence of recombinants among these new strains51, blurring the distinction between molecular groups based on CP sequences only. As in 2004–2008, landscape heterogeneity seemed to affect virus dispersal51.

In a more general perspective, this study shows how mechanistic approaches can be used to infer the historical dynamics of invasive genotypes or species from initial introduction. These approaches enable considering hypothetical scenarios, to get a better understanding of the impact of the biological interactions on the overall dynamics. Here, in agreement with theoretical results in53, the simulations without the CS strain showed that its presence promotes a higher diversity among the emerging strains, by altering the outcome of the competition between the ESs, and by slowing down the overall dynamics, thus reducing founder effects. Another advantage of using a mechanistic-statistical approach, compared to a correlative approach, is that the parameter values bring some insight into biological processes and life history traits, (e.g. the diffusion coefficient is related to dispersal ability). Good knowledge of the parameter values, especially for the biological parameters, will be helpful for future modelling, either with reaction–diffusion models or with other approaches such as stochastic diffusion models, which share some common parameters with reaction–diffusion models (e.g.,12). The method developed in this work is computationally very costly. We plan to develop much faster methods, based on deterministic optimization algorithms and analytic descriptions of the gradients of the likelihood. Introducing specific fitness parameters, besides winter decay rate, for the different groups will also help to better understand the effect of interactions between variants on the evolution of viral populations.