1 Introduction

Ribonucleic acid (RNA) viruses such as Foot-and-mouth disease virus (FMDV) and Influenza A (H1N1) have been shown to have high genetic variability (Holland et al. 1982). This variability results in changes to the virus proteins that effect recognition by the host immune system, also known as antigenic differences. Differences in these proteins, also known as antigenic proteins, affect how antigenically similar different viruses are. As a consequence of the antigenic variability in the viruses, vaccines are only effective against field strains that are genetically related and antigenically similar to the vaccine strain (Mattion et al. 2004). This feature of FMDV and Influenza makes it important to estimate antigenic similarity among strains and therefore cross-protection, the protection against one strain conferred by previous exposure to another strain by either infection or vaccination (Paton et al. 2005).

RNA virus are classified into serotypes, genetically and antigenically distinct virus lineages between which there is no effective degree of cross-protection. Individual vaccines may protect against large groups of genetically diverse viruses within a serotype, however there are antigenically distinct subtypes against which the vaccines do not work. In FMDV, South African Territories types 1 and 2 (SAT1 and SAT2) are responsible for the majority of FMDV outbreaks in cattle in the region, while the H1N1 virus has been responsible for several major flu outbreaks; Spanish Flu in 1918 and Swine Flu in 2009. Within these serotypes of FMDV and Influenza are significant levels of antigenic variability, which allows us to examine the relationship between genetic and antigenic variation and to determine which protein changes affect recognition by the immune system. Given the importance of these serotypes in the region and the difficulties with vaccination caused by antigenic variation, it is vital to understand how genetic changes affect antigenicity and within-serotype cross-protection.

In the outer capsid or virus shell, proteins influence antigenicity. Many areas of these proteins are exposed on the surface of the capsid and among these are antigenic regions that are recognised by the host immune system. Single amino acid substitutions (mutations) within these antigenic regions can dramatically affect recognition by the immune system. Identifying the specific amino acid residues that comprise these antigenic regions and the substitutions that cause antigenic differences is critical to understanding antigenic similarity among viruses and cross-protection within serotypes. In the FMDV serotypes, both SAT1 and SAT2 are known to share one major antigenic region, the highly flexible VP1 G-H loop (Crowther et al. 1993b), a cord of connected amino acid residues. Additional residues have also been identified on the SAT serotypes (Crowther et al. 1993b; Grazioli et al. 2006), as well as others on related serotypes A, O, C and Asia1 which may also occur within the SAT serotypes (Grazioli et al. 2013; Kitson et al. 1990; Lea et al. 1994; Saiz et al. 1991). For the H1N1 virus, experimental studies have identified four major antigenic sites (Caton et al. 1982), as well as a number of other sites known to be important (McDonald et al. 2007).

Changes in the antigenic proteins occur as the strains within each serotype evolve. The accumulation of these changes in geographically isolated virus lineages allows for the division of serotypes into topotypes, groups of genetically similar viruses associated with a particular geographic area (Knowles and Samuel 2003). Strains within topotypes share a common evolutionary history that is distinct from strains within other topotypes. Accounting for the genetic differences between topotypes that have arisen due to their significantly different evolutionary paths is necessary for understanding antigenic variability (Reeve et al. 2010). Interpreting the antigenic consequences of genetic differences between topotypes can improve our understanding of the evolutionary history of serotypes, as well as the likely extent of vaccine coverage across topotypes.

In order to infer the antigenic importance of specific genetic changes that have occurred during the evolution of the virus, we require in addition to genetic data, a measure of the antigenic similarity of any two virus strains. Virus Neutralisation (VN) titre and Haemagglutination inhibition (HI) assay give in vitro measures of antigenic similarity between a protective, i.e. a potential vaccine, and a challenge strain, i.e. a potential circulating virus (Hirst 1942; WHO 2011). They approximate the extent to which one strain confers protection against another by recording the maximum dilution at which the virus-specific antibody in a sample of antiserum from a cow (VN titre) or ferret (HI assay) exposed to one strain of the virus (the protective strain) remains able to neutralise a sample of a second virus strain (the challenge strain). Higher titres or assay measures indicate that the antiserum still neutralises the challenge strains at greater dilution and therefore that the protective and challenge strains are more antigenically similar.

In principle, it is possible to identify experimentally how a mutation of the residues affects antigenicity. However, due to the large number of co-circulating virus strains, this is time consuming and expensive. Developing in silico predictors of VN titre and HI assay that are robust and can account for experimental variation can help substantially reduce the number of strains that must be tested in order to select an effective vaccine strain. Previously mixed-effects models have been used on a variety of datasets to model antigenic variability in both FMDV and Influenza by accounting for the experimental variation in the VN titre or HI assay measurements (Reeve et al. 2010, 2016; Harvey et al. 2016).

However in order to identify antigenically important residues, we must infer which explanatory variables are selected in the model. While stepwise regression techniques, such as that of Reeve et al. (2010), can be used to select variables within standard mixed-effects models, they do not explore all variable configurations and can result in a non-optimal solution. An improved method, which allows for simultaneous variable selection, is the Least Absolute Shrinkage and Selection Operator (LASSO) of Tibshirani (1996), which uses an \(\ell _1\) penalty to select the variables. Schelldorfer et al. (2011) have recently extended the LASSO to mixed-effects models with a single random effect and we further extend the method here to work with multiple random effects and the elastic net penalty (Zou and Hastie 2005).

A drawback of the LASSO and elastic net, is the \(\ell _1\) regularisation term itself, equivalent to a Laplace prior in a Bayesian context (Park and Casella 2008). This is computationally efficient and leads to a convex optimisation problem for penalised maximum likelihood or Bayesian maximum a posteriori (MAP) inference. However, \(\ell _1\) regularisation gives an increased bias from shrinkage while not giving sufficient sparsity, as discussed in Chapter 13 of Murphy (2012).

Spike and slab priors, as proposed in Mitchell and Beauchamp (1988), improve variable selection and avoid excessive shrinkage, but lead to a non-convex optimisation problem. The performance improvement of spike and slab priors over \(\ell _1\) regularisation methods has previously been reported (Mohamed et al. 2012). Here these priors are integrated into a Bayesian hierarchical mixed-effects model and this has a number of advantages. In particular Bayesian hierarchical models allow consistent inference of all parameters and hyper-parameters, and inference borrows strength by the systematic sharing and combination of information; see Gelman et al. (2013).

The previously proposed models used datasets containing a variety of explanatory variables in order to explain variation in VN titre and HI assay measurements. The datasets, which we also use here, include variables showing the presence or absence of amino acid substitutions (changes in protein composition caused by genetic mutations) at different residues on the surface of the virus. The selection of these variables within any model then indicates the relevance of that residue in determining the antigenic similarity of the virus strains. Additionally, the datasets include variables to correct for the phylogenetic structure of the serotypes, in order to account for the shared evolutionary history of the strains. All the previous methods have used mixed-effects models in order to correct for the experimental variation associated with the data collection. The random effects include information about the strains tested, the animal from which the serum was taken and when the lab work was completed and by whom.

Using these variables and random effects, Reeve et al. (2010) identified a single known antigenic residue, VP3 138, in the VP1 G-H loop from a relatively small amount of SAT1 data. More recently Reeve et al. (2016) explored an extended SAT1 dataset, which includes an increased number of strains and repeated experiments, and identified a number of known antigenic residues. Reeve et al. (2010) also used their methods on a small SAT2 dataset but were unable to select any significant residues in their model. Using a larger H1N1 dataset Harvey et al. (2016) identified a number of known antigenic site from the four major antigenic regions and other known sites.

The main purpose of this paper is to develop a Sparse hierArchical Bayesian model for detecting Relevant antigenic sites in virus Evolution (SABRE) and use it to analyse the SAT1, SAT2 and H1N1 datasets from Reeve et al. (2010, 2016) and Harvey et al. (2016). We propose and evaluate three different versions of the SABRE method and use them to identify a number of previously unidentified residues, as well as predicting a number of sites that could be plausibly antigenic. Moreover we use the data to investigate the antigenic changes in the evolutionary history of the different lineages within the FMDV serotypes and we are able to make reasonable predictions based on biologically estimated topotypes. In addition to these predictions of antigenic change, we also propose a new method for understanding when non-antigenic changes occur in the evolution of the virus.

2 Classical methods

A variety of classical statistical methods have previously been applied in predicting antigenically significant sites. Before we introduce the SABRE method, we review some of these methods and propose variations which are applicable in this context. For all of the methods we use the following notation; bold upper case letter, \(\mathbf{X }\), for a matrix, bold lower case letter, \(\mathbf{x }\), for a column vector, and non-bold letter, x, for a scalar. We do not distinguish between random variables and their realisations.

For further notational details see Tables 1 and 2 in the online supplementary materials.

2.1 Classical mixed-effects model

We define the response \(\mathbf{y } = (y_1,\dots ,y_N)^\top \) to be the log VN titre or log HI assay and denote the explanatory variables, \(\mathbf{X }\), to be indicators of mutational changes at different residues and information on the phylogenetic structure (see Sect. 4). The explanatory variables, \(\mathbf{X }\), are given as a matrix of \(J+1\) columns and N rows, where the first column is a column full of ones for the intercept. Each column j of explanatory variables, \(\mathbf{x }_j\), is then given an associated regression coefficient, \(w_j\), to control its influence on the response.

We further set the random-effects design matrix, \(\mathbf{Z }\), as the matrix of indicators with N rows and \(||\mathbf{b }||\) columns, where ||.|| indicates the length of the vector. The random-effects design matrix describes experimental conditions which must be accounted for based on the information that is available (see Sect. 4). For our datasets this can include information about the strains tested, the animal from which the serum was taken and when the lab work was completed. The random-effects coefficients are given as \(\mathbf{b } = (\mathbf{b }_{1}^\top ,\ldots ,\mathbf{b }_{G}^\top )^\top \), with each \(\mathbf{b }_g\) relating to a group \(g \in \{1, \ldots , G\}\), e.g. challenge strain. Each \(\mathbf{b }_{g}\) has length \(||\mathbf{b }_g||\) and follows a zero mean Gaussian distribution with a group dependent variance, \(\mathbf{b }_{g} \sim {\mathcal {N}}(\mathbf{b }_{g} | \mathbf{0 }, \sigma _{b,g}^2 \mathbf{I }\)), where \(\mathbf{I }\) is the identity matrix. This leads to the random-effects coefficients having the following joint distribution \(\mathbf{b } \sim {\mathcal {N}}(\mathbf{b } | \mathbf{0 }, \varvec{\varSigma }_{\mathbf{b }})\). Here we define \(\varvec{\varSigma }_{\mathbf{b }}\) to be a matrix with \(\varvec{\sigma }_{\mathbf{b }}^2 = (\sigma ^2_{b,1},\ldots ,\sigma ^2_{b,1},\sigma ^2_{b,2},\ldots ,\varvec{\sigma }^2_{b,G})^\top \) on the diagonal, i.e. \(\varvec{\varSigma }_{\mathbf{b }} = diag(\varvec{\sigma }_{\mathbf{b }}^2)\). In this case each \(\sigma _{b,g}^2\) is repeated with length \(||\mathbf{b }_g||\). See Pinheiro and Bates (2000) for more details on mixed-effects models.

We define the model as:

$$\begin{aligned} \mathbf{y } = \mathbf{X } \mathbf{w } + \mathbf{Z }\mathbf{b } + \varvec{\varepsilon }~ ~ ~ \text{ where } ~ ~ ~ \varvec{\varepsilon }\sim {\mathcal {N}}\left( \varvec{\varepsilon }| \mathbf{0 }, \sigma _{\varepsilon }^2 \mathbf{I }\right) \end{aligned}$$
(1)

assigning the model additive iid Gaussian errors. Using a simple application of Gaussian integrals (Bishop 2006), we integrate over \(\mathbf{b }\) to give the likelihood:

$$\begin{aligned} L\left( \mathbf{w },\sigma _\varepsilon ^2,\varvec{\varSigma }_{\mathbf{b }} | \mathbf{y }, \mathbf{X }, \mathbf{Z }\right) = {\mathcal {N}}\left( \mathbf{y } | \mathbf{X } \mathbf{w }, \mathbf{Z } \varvec{\varSigma }_{\mathbf{b }} \mathbf{Z }^\top + \sigma _\varepsilon ^2\mathbf{I }\right) . \end{aligned}$$
(2)

In classical mixed-effects models, model comparison techniques are often used to choose which variables are included within the model. Reeve et al. (2010) used a mixture of forward inclusion and univariate analysis, making an adjustment for multiple testing using the Holm–Bonferroni correction to ensure a sparse model (Holm 1979). They firstly included terms to account for the evolutionary history of the viruses (see Sect. 4). They then did a univariate test for significance on the residue variables, where a p value of <0.05 corresponded to an antigenically important residue.

2.2 LASSO

A problem with the classical mixed-effects models of Reeve et al. (2010) is their model selection technique, which does not explore all variable configurations and can result in a non-optimal solution. A classical alternative which does allow for simultaneous variable selection is the LASSO of Tibshirani (1996, 2011). The LASSO achieves its variable selection through an \(\ell _1\) penalty (equivalent to a Bayesian Laplace prior). In the simplest case of linear regression, this gives the following parameter estimates:

$$\begin{aligned} \hat{\mathbf{w }} = {\mathop {\mathrm{argmin}}\limits _\mathbf{w }} \left\{ (\mathbf{y }- \mathbf{X }\mathbf{w })^2 + \lambda \sum _{j=1}^J | w_j | \right\} \end{aligned}$$
(3)

where we do not penalise the intercept \(w_0\) so that the model remains scale invariant (Hastie et al. 2009). This is a convex optimisation problem where a variety of fast and effective algorithms exist (e.g. Efron et al. 2004; Hastie et al. 2009). The effect of (3) is to simultaneously shrink and prune parameters \(\mathbf{w }\), thereby promoting a sparse model. The degree of sparsity depends on the regularisation parameter \(\lambda \), which can be optimised with cross-validation or information criteria.

A recent extension of the standard LASSO is the mixed-effects LASSO proposed by Schelldorfer et al. (2011), who marginalised over \(\mathbf{b }\) to estimate the regression coefficients \(\mathbf{w }\), random-effects variances \(\varvec{\sigma }_{\mathbf{b }}^2\) and the variance of the noise \(\sigma _\varepsilon ^2\) as:

$$\begin{aligned}&\left( \hat{\mathbf{w }},\hat{\varvec{\sigma }}_{\mathbf{b }}^2,{\hat{\sigma }}_\varepsilon ^2\right) = \mathop {\mathrm{argmin}}\limits _{\mathbf{w },\varvec{\sigma }_{\mathbf{b }}^2> 0, \sigma _\varepsilon ^2> 0} \left\{ \log |\mathbf{V }| + (\mathbf{y }- \mathbf{X }\mathbf{w })^\top \mathbf{V }^{-1}(\mathbf{y }- \mathbf{X }\mathbf{w }) + \lambda \sum _{j=1}^J | w_j | \right\} \end{aligned}$$
(4)

where \(\mathbf{V } = \mathbf{Z } \varvec{\varSigma }_{\mathbf{b }} \mathbf{Z }^\top + \sigma _\varepsilon ^2 \mathbf{I }\). For our study we choose the penalty parameter \(\lambda \) based on the Bayesian Information Criterion (BIC), as in Schelldorfer et al. (2011).

A problem with the mixed-effects LASSO of Schelldorfer et al. (2011) is that the method has only been developed for one random effect. While it is possible to map the Cartesian product of several random effects onto a single random effect, doing so can lead to over-estimating the complexity of the model. We have therefore developed our own mixed-effects LASSO which is able to handle multiple random effects. Our method uses a conjugate gradient optimisation strategy available in R (R Core Team 2013), but requires a tolerance that must be determined by the user. In practise defining this tolerance is easy to do, as for a large \(\lambda \) and standardised data there will be a group of regressors clearly grouped around zero. The tolerance can then be set such as to force these values to zero, i.e. exclusion from the model, and then \(\lambda \) reduced to create the so called LASSO path. Following the described optimisation strategy may not be as computationally efficient as the purpose-built block coordinate descent scheme proposed in Schelldorfer et al. (2011), but we have found in practise that they achieve the same results.

2.3 Elastic net

A potential improvement over the LASSO is the elastic net of Zou and Hastie (2005). It has several advantages including the ability to select more than N variables in a \(J > N\) situation, whereas the LASSO saturates to at most N variables. More importantly for our application is that it also deals better with groups of correlated variables. While the LASSO will arbitrarily select one of the correlated variables, the penalty of the elastic net allows it to keep all of the variables in the model. See Section 2.3 of Zou and Hastie (2005) for more information on the grouping effect.

The elastic net combines \(\ell _1\) and \(\ell _2\) penalties and in the case of linear regression gives the following parameter estimates:

$$\begin{aligned} \hat{\mathbf{w }} = {\mathop {\mathrm{argmin}}\limits _\mathbf{w }} \left\{ (\mathbf{y }- \mathbf{X }\mathbf{w })^2 + \alpha \lambda \sum _{j=1}^J | w_j | + (1 - \alpha ) \lambda \sum _{j=1}^J | w_j |^2 \right\} \end{aligned}$$
(5)

where \(\lambda \) is the penalty parameter and \(\alpha \) controls the ratio of the \(\ell _1\) and \(\ell _2\) penalties. When \(\alpha =1\) the elastic net is equivalent to the LASSO and likewise ridge regression when \(\alpha =0\). We have investigated values of \(\alpha \) between 0.2 and 0.8 and found the results varied across different datasets and did not show a clear best choice of \(\alpha \); see Section 8 of the online supplementary materials. We have therefore set \(\alpha =0.3\) for the examples in the main paper following Ruyssinck et al. (2014) and relegated the remaining results to the supplementary materials for conciseness; see Sect. 6.1.

Like the LASSO, we can expand the elastic net into the context of a mixed-effects model, something that we propose here:

(6)

where \(\mathbf{V } = \mathbf{Z } \varvec{\varSigma }_{\mathbf{b }} \mathbf{Z }^\top + \sigma _\varepsilon ^2 \mathbf{I }\). We use the same optimisation strategy we proposed for the mixed-effects LASSO.

3 SABRE methods

The LASSO and elastic net have multiple weaknesses, as we have discussed in Sect. 1, and they have been shown to be sub-optimal compared to Bayesian approaches (Dalton and Dougherty 2012; Mohamed et al. 2012) such as the spike and slab prior (Mitchell and Beauchamp 1988). In the remainder of this section we incorporate the spike and slab prior into a hierarchical Bayesian model as shown in the Probabilistic Graphical Model (PGM) in Fig. 1. Figure 1 shows a particular version of the SABRE method, the conjugate SABRE method, but this section also discusses two other versions of the SABRE method; the semi-conjugate SABRE method (Figure 1 in the online supplementary materials) and the binary mask SABRE method (Figure 3 of the online supplementary materials). The parameters of the models are sampled from their posterior distributions using Markov chain Monte Carlo (MCMC).

Fig. 1
figure 1

Compact representation of the conjugate SABRE method as a PGM. The grey circles refer to the data and fixed (higher-order) hyperparameters, while the white circles refer to parameters and hyperparameters that are inferred. PGMs for the semi-conjugate and conjugate binary mask versions of the SABRE method are given in Figures 1 and 3 of the online supplementary materials

3.1 Likelihood

The likelihood for our Bayesian model is similar to the classical mixed-effects model described in Sect. 2.1, however we include only the relevant residue and phylogenetic tree variables, \(\mathbf{X }_{\varvec{\gamma }}\), and regressors, \(\mathbf{w }_{\varvec{\gamma }}\). As with classical mixed-effects models we separate the intercept, \(w_0\), from the other regressors such that it is always included in the model; see Fig. 1. We expect the intercept to be high as each strain should offer strong protection against itself and hence there should be high log VN titre or log HI assay, \(\mathbf{y }\), when all covariates are equal to zero, i.e. the protective and challenge strains are the same.

$$\begin{aligned} p\left( \mathbf{y }|w_0,\mathbf{w }_{\varvec{\gamma }},\mathbf{b },\sigma _\varepsilon ^2,\mathbf{X }_{\varvec{\gamma }},\mathbf{Z }\right) = {\mathcal {N}}\left( \mathbf{y } | \mathbf{1 }w_0 + \mathbf{X }_{\varvec{\gamma }}\mathbf{w }_{\varvec{\gamma }}+ \mathbf{Z }\mathbf{b }, \sigma _\varepsilon ^2 \mathbf{I }\right) . \end{aligned}$$
(7)

The relevance of the jth column of \(\mathbf{X }\) is determined by \(\gamma _j \in \{0,1\}\), where feature j is said to be relevant if \(\gamma _j=1\), giving \(\varvec{\gamma }= (\gamma _1, \ldots , \gamma _J)^\top \in \{0,1\}^J\). We then define \(\mathbf{X }_{\varvec{\gamma }}\) to be the matrix of relevant explanatory variables with \(||\varvec{\gamma }||\) columns and N rows, where \(||\varvec{\gamma }|| = \sum _{j=1}^J \gamma _j\) is the number of non-zero elements of \(\varvec{\gamma }\). Similarly \(\mathbf{w }_{\varvec{\gamma }}\) is given as the column vector of regressors, where the inclusion of each parameter is dependent on \(\varvec{\gamma }\). This is demonstrated by the following example:

$$\begin{aligned} \mathbf{X }&= \begin{bmatrix} x_{1,1}&\quad x_{1,2}&\quad x_{1,3} \\ x_{2,1}&\quad x_{2,2}&\quad x_{2,3} \\ x_{3,1}&\quad x_{3,2}&\quad x_{3,3} \end{bmatrix}; ~ \mathbf{X }_{\varvec{\gamma }}= \begin{bmatrix} x_{1,1}&\quad x_{1,3} \\ x_{2,1}&\quad x_{2,3} \\ x_{3,1}&\quad x_{3,3} \end{bmatrix}; ~ \mathbf{w } = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \end{bmatrix}; ~ \mathbf{w }_{\varvec{\gamma }} = \begin{bmatrix} w_1 \\ w_3 \end{bmatrix};\nonumber \\ \varvec{\gamma }&= \begin{bmatrix} \gamma _1 = 1 \\ \gamma _2 = 0 \\ \gamma _3 = 1 \end{bmatrix}. \end{aligned}$$
(8)

An alternative to this model is the binary mask model; e.g. Chapter 13 of Murphy (2012). In binary mask models the indicator variables, \(\varvec{\gamma }\), ‘mask’ or hide the impact of the non-zero coefficients, \(\mathbf{w }\), and explanatory variables, \(\mathbf{X }\), when the variable is not selected:

$$\begin{aligned} p\left( \mathbf{y }|w_0,\mathbf{w },\varvec{\gamma },\mathbf{b },\sigma _\varepsilon ^2,\mathbf{X },\mathbf{Z }\right) = {\mathcal {N}}\left( \mathbf{y }| \mathbf{1 }w_0 + \mathbf{X } \varvec{\varGamma } \mathbf{w } + \mathbf{Z }\mathbf{b }, \sigma _\varepsilon ^2 \mathbf{I }\right) \end{aligned}$$
(9)

where \(\varvec{\varGamma } = diag(\varvec{\gamma })\). We have tested the binary mask version of the model against the other versions of the SABRE method in Sect. 6.1 and found that the results are reasonably similar. For clarity of the paper we have moved a more in depth description of the model into Section 5 of the online supplementary materials, as we believe that the spike and slab model makes more sense theoretically as the variance of the fixed effects, \(\sigma _{w,h}^2\), is only calculated based on those variables included in the model.

3.2 Noise and intercept priors

As with the classical methods described in Sect. 2, we assume additive iid Gaussian noise with variance \(\sigma _\varepsilon ^2\). In a Bayesian context we wish to infer \(\sigma _\varepsilon ^2\), so we specify the conjugate prior:

$$\begin{aligned} \sigma _{\varepsilon }^2 \sim {{\mathcal {IG}}}\left( \sigma _{\varepsilon }^2 | \alpha _\varepsilon , \beta _\varepsilon \right) \end{aligned}$$
(10)

where the hyper-parameters \(\alpha _\varepsilon \) and \(\beta _\varepsilon \) are fixed, as indicated by the grey nodes in Fig. 1.

Additionally we also require a prior on our intercept, \(w_0\);

$$\begin{aligned} w_0 \sim {\mathcal {N}}\left( w_0 | \mu _{w_0}, \sigma ^2_{w_0} \sigma _{\varepsilon }^2 \right) . \end{aligned}$$
(11)

We treat the intercept differently from the remaining regressors, wishing to use vague prior settings so as not to penalise this term and effectively make the model scale invariant (Hastie et al. 2009).

The distribution of \(w_0\) also has \(\sigma _\varepsilon ^2\) included, which makes the model conjugate rather than semi-conjugate, as discussed in Chapter 3 of Gelman et al. (2013). Additionally, there are relationships between \(w_0\), \(\mathbf{w }_{\varvec{\gamma }}\), \(\varvec{\mu }_{\mathbf{w }} = (\mu _{w,1},\ldots ,\mu _{w,H})^\top \) (defined in Sect. 3.3) and the error variance, \(\sigma _\varepsilon ^2\), increasing information sharing and meaning that the error variance in terms of model fit is reflected in the distribution of the regression coefficients; see Fig. 1. In addition to the increased information sharing, conjugate models also have a computational advantage as the sampling strategy can be improved through using collapsed Gibbs sampling. The difference between the conjugate and semi conjugate SABRE models in terms of accuracy and computational efficiency is discussed in Sect. 6.1. Additionally, the PGM for the semi-conjugate version of our model is given in Figure 1 of the online supplementary materials.

3.3 Spike and slab priors

Spike and slab priors have been used in a number of different contexts and have been shown to outperform \(\ell _1\) methods both in terms of variable selection and out-of-sample predictive performance (Mohamed et al. 2012). They were originally proposed by Mitchell and Beauchamp (1988) as a mixture of a Gaussian distribution and Dirac spike, but have also been used as a mixture of two Gaussians (George and McCulloch 1993, 1997). Binary mask models (e.g. Jow et al. 2014) have also been used as an alternative to the spike and slab prior in a number of applications, as is discussed in Chapter 13 of Murphy (2012). A binary mask based version of the conjugate SABRE method is compared in Sect. 6.1 and given in Section 5 of the online supplementary materials.

The idea behind the spike and slab prior is that the prior reflects whether the feature is relevant based on the values of \(\varvec{\gamma }\). In this way we expect that \(w_{j,h} = 0\) if \(\gamma _j = 0\), i.e. the feature is irrelevant, and conversely it should be non-zero if the variable is relevant, \(w_{j,h} \ne 0\) if \(\gamma _i = 1\). For generality, we allow the models the option to have multiple groups of variables \(h \in \{1,\ldots ,H\}\) which are defined by j, i.e. \(w_{j,h}\) is shorthand for \(w_{j,h_j}\). However this is not used in the results reported in Sect. 6. A conjugate prior, with \(\sigma _\varepsilon ^2\) added for further conjugacy, is then assigned where the feature is relevant and a Dirac spike at zero where it is not:

$$\begin{aligned} p\left( w_{j,h} | \gamma _{j},\mu _{w,h},\sigma _{w,h}^2,\sigma _\varepsilon ^2\right) = \left\{ \begin{array}{l l} \delta _0(w_{j,h}) &{}\quad \text{ if } ~ \gamma _j = 0 \\ {\mathcal {N}}\left( w_{j,h} | \mu _{w,h}, \sigma _{w,h}^2 \sigma _{\varepsilon }^2\right) &{}\quad \text{ if } ~ \gamma _j = 1 \end{array}\right. \end{aligned}$$
(12)

for \(j \in 1,\ldots ,J\) and where \(\delta _0\) is the delta function. Here we have a spike at 0 and as \(\sigma _{w,h}^2 \sigma _{\varepsilon }^2 \rightarrow \infty \) the distribution, \(p(w_{j,h} | \gamma _j=1)\), approaches a uniform distribution, a slab of constant height.

The prior for the variance of the parameters selected is then given by:

$$\begin{aligned} \sigma _{w,h}^2 \sim {{\mathcal {IG}}}\left( \sigma _{w,h}^2 | \alpha _{w,h}, \beta _{w,h}\right) . \end{aligned}$$
(13)

By choosing the same fixed hyper-parameters, \(\alpha _{w,h}\) and \(\beta _{w,h}\) for each h, we lose information coupling between the different groups, although this could be regained with an additional layer in the hierarchical model.

In addition to \(\sigma _{w,h}^2\), we use the hyper-parameters \(\mu _{w,h}\) to reflect the likely non-zero means of each group h:

$$\begin{aligned} \mu _{w,h} \sim {\mathcal {N}}\left( \mu _{w,h} | \mu _{0,h}, \sigma _{0,h}^2 \sigma _{\varepsilon }^2 \right) \end{aligned}$$
(14)

where the hyper-parameters \(\mu _{0,h}\) and \(\sigma _{0,h}^2\) are fixed and \(\sigma _\varepsilon ^2\) is again included in the variance for further conjugacy. This specification comes from our biological understanding of the problem. In the FMDV and H1N1 data we are likely to observe a comparatively large intercept, with negative regression coefficients, \(w_{j,h}\), reflecting the fact that any mutational changes are likely to reduce the similarity between virus strains, therefore reducing the measured VN titre or HI assay.

For mathematical convenience we then define the prior distribution of \(\mathbf{w }_{\varvec{\gamma }}^* = (w_0,\mathbf{w }_{\varvec{\gamma }}^\top )^\top \) as:

$$\begin{aligned} \mathbf{w }_{\varvec{\gamma }}^* \sim {\mathcal {N}}\left( \mathbf{w }_{\varvec{\gamma }}^* | \mathbf{m }_{\varvec{\gamma }}, \sigma _\varepsilon ^2 \varvec{\varSigma }_{\mathbf{w }_{\varvec{\gamma }}^*}\right) \end{aligned}$$
(15)

where \(\mathbf{m }_{\varvec{\gamma }} = (\mu _{w_0},\mu _{w,1},\ldots ,\mu _{w,1},\mu _{w,2},\ldots ,\mu _{w,H})^\top \) and \(\varvec{\varSigma }_{\mathbf{w }_{\varvec{\gamma }}^*} = diag(\varvec{\sigma }_{\mathbf{w }^*}^2)\) with \(\varvec{\sigma }_{\mathbf{w }^*}^2 = (\sigma ^2_{w_0},\sigma ^2_{w,1},\ldots ,\) \(\sigma ^2_{w,1},\sigma ^2_{w,2},\ldots ,\sigma ^2_{w,H})^\top \). Each \(\mu _{w,h}\) and \(\sigma _{w,h}^2\) is repeated with length \(||\mathbf{w }_{\varvec{\gamma },h}||\) dependent on \(\varvec{\gamma }\).

The final part of the spike and slab prior is to set a prior for \(\varvec{\gamma }\), the parameters which determine the relevance of the variables:

$$\begin{aligned} p(\varvec{\gamma }|\pi )&= {\prod _{j=1}^J} {{\mathrm{Bern}}}(\gamma _j|\pi ) \end{aligned}$$
(16)

where \(\pi \) is the probability of the individual variable being relevant. The value of \(\pi \) can either be set as a fixed hyper-parameter as in Sabatti and James (2005), where the authors argue that it should be determined by underlying knowledge of the problem. Alternatively it can be given a conjugate Beta prior:

$$\begin{aligned} \pi \sim {\mathcal {B}}(\pi | \alpha _\pi ,\beta _\pi ) \end{aligned}$$
(17)

as has been used here. This is a more general model, which subsumes a fixed \(\pi \) as a limiting case for \(\alpha _\pi \beta _\pi /((\alpha _\pi + \beta _\pi )^2(\alpha _\pi + \beta _\pi + 1)) \rightarrow 0\) and has also been shown to act as a multiplicity correction in Scott and Berger (2010).

Fig. 2
figure 2

PGMs for the two different specifications of the hierarchical random-effects model. a Classical random-effects model using Gaussian and Inverse-Gamma priors. b Half-t prior specified in a hierarchical manner, as suggested by Gelman (2006)

3.4 Random-effects priors

In mixed-effects models the random effects, \(b_{k,g}\), are usually given group dependant Gaussian priors where the group g is defined by k, i.e. \(b_{k,g}\) is shorthand for \(b_{k,g_k}\):

$$\begin{aligned} b_{k,g} \sim {\mathcal {N}}\left( b_{k,g} | \mu _{b,g},\sigma _{b,g}^2\right) . \end{aligned}$$
(18)

We define this to have a fixed mean, \(\mu _{b,g}=0\), and a common variance parameter, \(\sigma _{b,g}^2\), with a conjugate Inverse-Gamma prior for each random-effects group g, as shown in Fig. 2a:

$$\begin{aligned} \sigma _{b,g}^2 \sim {{\mathcal {IG}}}\left( \sigma _{b,g}^2 | \alpha _{b,g}, \beta _{b,g}\right) \end{aligned}$$
(19)

where \(\alpha _{b,g}\) and \(\beta _{b,g}\) are fixed hyper-parameters for each g and we define \(\mathbf{b } \sim {\mathcal {N}}(\mathbf{b } | \mathbf{0 }, \varvec{\varSigma }_{\mathbf{b }})\) where \(\varvec{\varSigma }_{\mathbf{b }} = diag(\varvec{\sigma }_{\mathbf{b }}^2)\) with \(\varvec{\sigma }_{\mathbf{b }}^2 = (\sigma _{b,1}^2,\ldots ,\sigma _{b,1}^2,\sigma _{b,2}^2,\ldots ,\sigma _{b,G}^2)^\top \) such that each \(\sigma _{b,g}^2\) is repeated with length \(||\mathbf{b }_g||\).

An alternative to this hierarchical prior setting is the folded-non-central-t prior distribution described in Gelman (2006), which gives a redundant multiplicative reparameterisation to the model given in (18), (19) and Fig. 2a. This prior has several potential advantages over the Inverse-Gamma prior. Firstly it is considered to be a prior that better represents non-informativeness. While the posterior distribution can be sensitive to the fixed hyper-parameter settings of an Inverse-Gamma prior, the impact is reduced when the folded-non-central-t prior is used. In that case the posterior distribution does not have a sharp peak at zero unlike with an vague Inverse-Gamma prior, reducing problems with underestimating the variance. Secondly, Gelman (2006) found that the folded-non-central-t prior results in a more realistic posterior distribution of \(\sigma _{b,g}^2\) when there are only a few random effects (usually less than 8) in each group g. The author showed that the posterior distribution reflected the marginal distribution well at its low end, but removed its unrealistically heavy tail; see Figure 2 in Gelman (2006). Doing this ensures that \(\sigma _{b,g}^2\) is not overestimated and does not lead to non-optimal shrinkage of \(\mathbf{b }_g\). Finally the overparameterisation can improve sampling by reducing the dependence between parameters in the hierarchical model leading to improved MCMC convergence (Gelman 2004).

The redundant multiplicative reparameterisation used for this prior specification sets \(\mathbf{b } = \varvec{\eta }\xi \) and is given by the following conjugate priors and shown in Fig. 2b:

$$\begin{aligned} \eta _{k,g}&\sim {\mathcal {N}}\left( \eta _{k,g} | \mu _{\eta ,g}, \sigma ^2_{\eta ,g}\right) \end{aligned}$$
(20)
$$\begin{aligned} \xi&\sim {\mathcal {N}}\left( \xi | \mu _{\xi }, \sigma ^2_{\xi }\right) \end{aligned}$$
(21)

where \(\mu _{\xi }\) and \(\sigma ^2_{\xi }\) are fixed for identifiability, \(\mu _{\eta ,g}=0\), \(\eta _{k,g}\) is shorthand for \(\eta _{k,g_k}\) and each \(b_{k,g} = \xi \eta _{k,g}\). Following Gelman (2006), we fix \(\mu _{\xi } = 0\) which leads to the half-t distribution. We then set a prior on \(\sigma ^2_{\eta ,g}\):

$$\begin{aligned} \sigma ^2_{\eta ,g} \sim {{\mathcal {IG}}}\left( \sigma ^2_{\eta ,g} | \alpha _{\eta ,g}, \beta _{\eta ,g}\right) \end{aligned}$$
(22)

where \(\alpha _{\eta ,g}\) and \(\beta _{\eta ,g}\) are fixed hyper-parameters. In terms of classical mixed-effects models, the variance is given by \(\sigma ^2_{b,g} = \xi ^2 \sigma ^2_{\eta ,g}\). For convenience we define \(\varvec{\eta }\sim {\mathcal {N}}(\varvec{\eta }| \mathbf{0 }, \varvec{\varSigma }_{\varvec{\eta }})\) when \(\mu _{\eta ,g}=0\) for all g and where \(\varvec{\varSigma }_{\varvec{\eta }} = diag(\varvec{\sigma }^2_{\varvec{\eta }})\) with \(\varvec{\sigma }_{\varvec{\eta }}^2 = (\sigma _{\eta ,1}^2,\ldots ,\sigma _{\eta ,1}^2,\sigma _{\eta ,2}^2,\ldots ,\sigma _{\eta ,G}^2)^\top \) where each \(\sigma _{\eta ,g}^2\) is repeated with length \(||\varvec{\eta }_g||\).

3.5 Posterior inference

In order to explore the posterior distribution of the parameters we use an MCMC algorithm. Having chosen conjugate priors where possible means we can run a Gibbs sampler for the majority of parameters (Ripley 1979; Geman and Geman 1984). The only exception is \(\varvec{\gamma }\), although it is possible to use component-wise Gibbs sampling with a small adaptation; see Sect. 3.6.1. Additionally we sample the intercept and regression parameters together and define \(\mathbf{w }_{\varvec{\gamma }}^* = (w_0,\mathbf{w }_{\varvec{\gamma }}^\top )^\top \), \(\mathbf{X }_{\varvec{\gamma }}^* = (\mathbf{1 },\mathbf{X }_{\varvec{\gamma }})\), \(\mathbf{m }_{\varvec{\gamma }} = (\mu _{w_0},\mu _{w,1},\ldots ,\mu _{w,1},\mu _{w,2},\ldots ,\mu _{w,H})^\top \) and \(\varvec{\varSigma }_{\mathbf{w }_{\varvec{\gamma }}^*} = diag(\varvec{\sigma }_{\mathbf{w }^*}^2)\) with \(\varvec{\sigma }_{\mathbf{w }^*}^2 = (\sigma ^2_{w_0},\sigma ^2_{w,1},\) \(\ldots ,\sigma ^2_{w,1},\sigma ^2_{w,2},\ldots ,\sigma ^2_{w,H})^\top \). Each \(\mu _{w,h}\) and \(\sigma _{w,h}^2\) is repeated with length \(||\mathbf{w }_{\varvec{\gamma },h}||\) dependent on \(\varvec{\gamma }\), as indicated below (15).

The conditional distributions for those parameters amenable to standard Gibbs sampling are derived in Section 2 of the online supplementary materials and given here, where by a slight abuse of notation \(\varvec{\theta }'\) denotes all the other parameters, excluding the ones on the left of the conditioning bar:

$$\begin{aligned} \mathbf{w }_{\varvec{\gamma }}^* | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {\mathcal {N}}\left( \mathbf{w }_{\varvec{\gamma }}^* | \mathbf{V }_{\mathbf{w }_{\varvec{\gamma }}^*} \mathbf{X }_{\varvec{\gamma }}^{*\top } \left( \mathbf{y }- \mathbf{Z }\mathbf{b }\right) + \mathbf{V }_{\mathbf{w }_{\varvec{\gamma }}^*} \varvec{\varSigma }_{\mathbf{w }_{\varvec{\gamma }}^*}^{-1}\mathbf{m }_{\varvec{\gamma }}, \sigma _\varepsilon ^2 \mathbf{V }_{\mathbf{w }_{\varvec{\gamma }}^*}\right) \end{aligned}$$
(23)
$$\begin{aligned} \mathbf{b } | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {\mathcal {N}}\left( \mathbf{b } | \tfrac{1}{\sigma _\varepsilon ^2}\mathbf{V }_{\mathbf{b }} \mathbf{Z }^\top \left( \mathbf{y }- \mathbf{X }_{\varvec{\gamma }}^* \mathbf{w }_{\varvec{\gamma }}^*\right) , \mathbf{V }_{\mathbf{b }}\right) \end{aligned}$$
(24)
$$\begin{aligned} \sigma _{b,g}^2 | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {{\mathcal {IG}}}\left( \sigma _{b,g}^2 | ~ ||\mathbf{b }_g||/2 + \alpha _{b,g}, \beta _{b,g} + \tfrac{1}{2} \mathbf{b }_g^\top \mathbf{b }_g\right) \end{aligned}$$
(25)
$$\begin{aligned} \mu _{w,h} | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {\mathcal {N}} \left( \mu _{w,h} | V_{\mu _\gamma ,h}^{-1}\left( {\scriptstyle \sum }\left( \mathbf{w }_{\varvec{\gamma },h}\right) /\sigma _{w,h}^{2} + \mu _{0,h}/\sigma _{0,h}^2\right) , \sigma _\varepsilon ^2 V_{\mu _\gamma ,h}\right) \end{aligned}$$
(26)
$$\begin{aligned} \sigma _{w,h}^2 | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {{\mathcal {IG}}}\left( \sigma _{w,h}^2 | ~ ||\mathbf{w }_{\varvec{\gamma },h}||/2 + \alpha _{w,h}, \beta _{w,h}\right. \nonumber \\&\quad \left. +\, \tfrac{1}{2 \sigma _\varepsilon ^2} \left( \mathbf{w }_{\varvec{\gamma },h} - \mathbf{1 }\mu _{w,h}\right) ^\top \left( \mathbf{w }_{\varvec{\gamma },h} - \mathbf{1 }\mu _{w,h}\right) \right) \end{aligned}$$
(27)
$$\begin{aligned} \sigma _{\varepsilon }^2 | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {{\mathcal {IG}}}\left( \sigma _{\varepsilon }^2 | \left( N + ||\mathbf{w }_{\varvec{\gamma }}^*|| + H\right) /2 + \alpha _\varepsilon , \beta _\varepsilon + \tfrac{1}{2}R_{\sigma _\varepsilon ^2}\right) \end{aligned}$$
(28)
$$\begin{aligned} \pi | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {\mathcal {B}}\left( \pi | \alpha _\pi + ||\varvec{\gamma }||, \beta _\pi + J - ||\varvec{\gamma }||\right) \end{aligned}$$
(29)

where we sample \(\sigma _{b,g}^2\), \(\mu _{w,h}\) and \(\sigma _{w,h}^2\) for each g and h respectively. We also define \(\mathbf{V }_{\mathbf{w }_{\varvec{\gamma }}^*} = \left( \mathbf{X }_{\varvec{\gamma }}^{*\top } \mathbf{X }_{\varvec{\gamma }}^* + \varvec{\varSigma }_{\mathbf{w }_{\varvec{\gamma }}^*}^{-1}\right) ^{-1}\), \(\mathbf{V }_{\mathbf{b }} = \left( \tfrac{1}{\sigma _\varepsilon ^2 }\mathbf{Z }^\top \mathbf{Z } + \varvec{\varSigma }_{\mathbf{b }}^{-1}\right) ^{-1}\), \(V_{\mu _\gamma ,h} = \left( \left( ||\mathbf{w }_{\varvec{\gamma },h}|| / \sigma _{w,h}^{2}\right) ^{-1} + \left( \sigma _{0,h}^2\right) ^{-1}\right) ^{-1}\) and \(R_{\sigma _\varepsilon ^2} = \left( \mathbf{y }- \mathbf{X }_{\varvec{\gamma }}^* \mathbf{w }_{\varvec{\gamma }}^* -\mathbf{Z }\mathbf{b }\right) ^\top \left( \mathbf{y }- \mathbf{X }_{\varvec{\gamma }}^* \mathbf{w }_{\varvec{\gamma }}^* -\mathbf{Z }\mathbf{b }\right) + \left( \mathbf{w }_{\varvec{\gamma }}^* - \mathbf{m }_{\varvec{\gamma }}\right) ^\top \varvec{\varSigma }_{\mathbf{w }_{\varvec{\gamma }}^*}^{-1} \left( \mathbf{w }_{\varvec{\gamma }}^* - \mathbf{m }_{\varvec{\gamma }}\right) + \sum _{h=1}^H (\mu _{w,h} - \mu _{0,h})^2 / \sigma _{0,h}^2\) for notational simplicity.

In order to use the half-t prior instead of the standard Inverse-Gamma prior we set \(\mathbf{b } = \varvec{\eta }\xi \) and \(\sigma ^2_{b,g} = \xi ^2 \sigma ^2_{\eta ,g}\). This would also need to be done for sampling \(\varvec{\gamma }\) in Sect. 3.6. We can then sample \(\varvec{\eta }\), \(\xi \) and \(\sigma ^2_{\eta ,g}\) from their conditional distributions, replacing (24) and (25):

$$\begin{aligned} \varvec{\eta }| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {\mathcal {N}}\left( \varvec{\eta }| \tfrac{\xi }{\sigma _\varepsilon ^2} \mathbf{V }_{\varvec{\eta }} \mathbf{Z }^\top \left( \mathbf{y }- \mathbf{X }_{\varvec{\gamma }}^* \mathbf{w }_{\varvec{\gamma }}^*\right) , \mathbf{V }_{\varvec{\eta }}\right) \end{aligned}$$
(30)
$$\begin{aligned} \xi | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {\mathcal {N}}\left( \xi | V_\xi \left[ \tfrac{\mu _\xi }{\sigma ^2_\xi } + \tfrac{1}{\sigma _\varepsilon ^2} \varvec{\eta }^\top \mathbf{Z }^\top \left( \mathbf{y }- \mathbf{X }_{\varvec{\gamma }}^* \mathbf{w }_{\varvec{\gamma }}^*\right) \right] ,V_\xi \right) \end{aligned}$$
(31)
$$\begin{aligned} \sigma _{\eta ,g}^2 | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }&\sim {{\mathcal {IG}}}\left( \sigma _{\eta ,g}^2 | ||\varvec{\eta }_g||/2 + \alpha _{\eta ,g}, \beta _{\eta ,g} + \tfrac{1}{2} \varvec{\eta }_g^\top \varvec{\eta }_g\right) \end{aligned}$$
(32)

where \(\mathbf{V }_{\varvec{\eta }} = (\tfrac{\xi ^2}{\sigma _\varepsilon ^2} \mathbf{Z }^\top \mathbf{Z } + \varvec{\varSigma }_{\varvec{\eta }}^{-1})^{-1}\) and \(V_\xi = (\tfrac{1}{\sigma _\xi ^2} + \tfrac{1}{\sigma _\varepsilon ^2} \varvec{\eta }^\top \mathbf{Z }^\top \mathbf{Z } \varvec{\eta })^{-1}\).

Collapsing will lead to improved mixing and convergence, e.g. Andrieu and Doucet (1999). We take advantage of the induced conjugacy to sample the parameters \(\varvec{\gamma }\), \(\mathbf{w }_{\varvec{\gamma }}^*\), \(\varvec{\mu }_{\mathbf{w }} = (\mu _{w,1},\ldots ,\) \(\mu _{w,H})^\top \), \(\sigma _\varepsilon ^2\) and \(\pi \) as a series of collapsed distributions rather than through Gibbs sampling:

$$\begin{aligned} p\left( \varvec{\gamma }, \mathbf{w }_{\varvec{\gamma }}^*,\varvec{\mu }_{\mathbf{w }},\sigma _\varepsilon ^2,\pi \right)&= p\left( \varvec{\gamma }\right) p\left( \pi | \varvec{\gamma }\right) p\left( \sigma _\varepsilon ^2 |\pi , \varvec{\gamma }\right) p\left( \varvec{\mu }_{\mathbf{w }} | \sigma _\varepsilon ^2, \pi , \varvec{\gamma }\right) \nonumber \\&\quad \times \,p\left( \mathbf{w }_{\varvec{\gamma }}^* | \varvec{\mu }_{\mathbf{w }}, \sigma _\varepsilon ^2, \pi , \varvec{\gamma }\right) \end{aligned}$$
(33)
$$\begin{aligned}&= p\left( \varvec{\gamma }\right) p\left( \pi | \varvec{\gamma }\right) p\left( \sigma _\varepsilon ^2 |\varvec{\gamma }\right) p\left( \varvec{\mu }_{\mathbf{w }} | \sigma _\varepsilon ^2, \varvec{\gamma }\right) p\left( \mathbf{w }_{\varvec{\gamma }}^* | \varvec{\mu }_{\mathbf{w }}, \sigma _\varepsilon ^2,\varvec{\gamma }\right) \end{aligned}$$
(34)

where the conditionality on \(\varvec{\theta }'\), \(\mathbf{X }\), \(\mathbf{Z }\) and \(\mathbf{y }\) has been dropped and the simplification from (33) to (34) follows from the conditional independence relations shown in Fig. 1, exploiting the fact that \(\pi \) is d-separated from the remaining parameters in the argument via \(\varvec{\gamma }\). These distributions are achieved by collapsing over parameters as derived in Sections 2, 3 and 4 of the online supplementary materials, and are used for all of our conjugate models.

3.6 Sampling the latent indicators

Sampling \(\varvec{\gamma }\) is more difficult, as it does not naturally take a distribution of standard form. However we can still get a valid conditional distribution and use a variety of techniques to sample from it. Here we have used collapsing methods following Sabatti and James (2005) to achieve faster mixing and convergence:

$$\begin{aligned} p\left( \varvec{\gamma }| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }\right)&\propto \int p\left( \varvec{\gamma },\pi ,\sigma _\varepsilon ^2, \mathbf{w }_{\varvec{\gamma }}^*,\varvec{\mu }_{\mathbf{w }}| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }\right) d\varvec{\mu }_{\mathbf{w }} d\mathbf{w }_{\varvec{\gamma }}^* d\pi d\sigma _\varepsilon ^2 \end{aligned}$$
(35)
$$\begin{aligned}&\propto \int p\left( \varvec{\gamma }| \pi \right) p\left( \pi \right) p\left( \mathbf{y }| \mathbf{w }_{\varvec{\gamma }}^*, \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \sigma _\varepsilon ^2\right) p\left( \mathbf{w }_{\varvec{\gamma }}^*|\varvec{\mu }_{\mathbf{w }},\sigma _\varepsilon ^2\right) \nonumber \\&\quad \times \,p\left( \varvec{\mu }_{\mathbf{w }}| \sigma _\varepsilon ^2\right) p\left( \sigma _\varepsilon ^2\right) d\varvec{\mu }_{\mathbf{w }} d\mathbf{w }_{\varvec{\gamma }}^* d\pi d\sigma _\varepsilon ^2 \end{aligned}$$
(36)

where the factorisation follows from the conditional independence relations depicted in Fig. 1 and the fixed hyper-parameters (given as grey circles in Fig. 1) have been dropped to improve notational clarity. The full distribution is available in Section 3 of the online supplementary materials.

Multiple methods have been proposed for sampling the latent variables, \(\varvec{\gamma }\). In this paper we look at two of these in particular; the component-wise Gibbs sampling approach and a Metropolis–Hastings step (Metropolis et al. 1953; Hastings 1970). In the latter we can propose changes to multiple parameters simultaneously for a computational improvement.

3.6.1 Component-wise Gibbs sampling

A component-wise Gibbs sampler can be used to consecutively sample each \(\gamma _j\) from \(\varvec{\gamma }\) in a random order dependent on the current state, c, of all the other \(\gamma \)s, \(\varvec{\gamma }_{-j}^c = \left( \gamma _{1}^c,\ldots ,\gamma _{j-1}^c,\gamma _{j+1}^{c}, \ldots ,\gamma _{J}^{c}\right) \). We can define the conditional distribution of the ith iteration of \(\gamma _j\) to be a Bernoulli distribution with probability \(p(\gamma _j = 1| \varvec{\theta }',\varvec{\gamma }_{-j}^c, \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }) = \frac{a}{a+b}\), where we define \(a \propto p(\gamma _j=1,\varvec{\gamma }_{-j}^c| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y })\) and \(b \propto p(\gamma _j=0,\varvec{\gamma }_{-j}^c| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y })\) using (36).

3.6.2 Block Metropolis–Hastings sampling

Block Metropolis–Hastings sampling can improve mixing and convergence through proposing sets, S, of latent indicator variables, \(\varvec{\gamma }_S\), simultaneously, where \(\varvec{\gamma }_S\) denotes a column vector of all the \(\gamma _j\)s where \(j \in S\) and \(\varvec{\gamma }_{-S}\) its compliment. The proposals are then accepted with the following acceptance rate:

$$\begin{aligned}&\alpha \left( \varvec{\gamma }_S^*,\varvec{\gamma }_S^{c} | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }, \varvec{\gamma }_{-S}^c\right) \nonumber \\&\quad := \min \left\{ \frac{q\left( \varvec{\gamma }_S^{c} | \varvec{\gamma }_S^*,\pi \right) p\left( \varvec{\gamma }_S=\varvec{\gamma }_S^*,\varvec{\gamma }_{-S}^c| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }\right) }{q\left( \varvec{\gamma }_S^* | \varvec{\gamma }_S^{c},\pi \right) p\left( \varvec{\gamma }_S=\varvec{\gamma }_S^c,\varvec{\gamma }_{-S}^c| \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*, \mathbf{Z }, \mathbf{y }\right) },1 \right\} \end{aligned}$$
(37)

where q(.) is a proposal density, which we set to be: \(q(\varvec{\gamma }_S^* | \varvec{\gamma }_S^{c},\pi ) = \prod _{j \in S} {{\mathrm{Bern}}}(\gamma _j^* | \pi )\). Proposed moves for independent sets of randomly ordered inclusion parameters, \(\varvec{\gamma }_S^*\), are then accepted if \(\alpha (\varvec{\gamma }_S^*,\varvec{\gamma }_S^{c} | \varvec{\theta }', \mathbf{X }_{\varvec{\gamma }}^*,\) \(\mathbf{Z }, \mathbf{y }, \varvec{\gamma }_{-S}^c)\) is greater than a random variable \(u \sim {\mathcal {U}}[0,1]\), until updates have been proposed for all the latent indicator variables.

4 Data

The antigenic similarity between two viruses can be measured by VN titre or HI assay. To get these measurements an antiserum is created from a single strain, the protective, and measured in its ability to neutralise a sample of a different strain, the challenge. However the resulting measures are affected by a number of experimental effects, which can potentially include the challenge strain, antiserum and date; see Sects. 4.2, 4.3, 4.4 and 4.5 . The experimental effects are accounted for via the random effects, with the random-effects coefficients, \(b_{k,g}\), representing an unknown effect of a particular random effect level on the measured log VN titre or log HI assay. Once this has been accounted for, it should then be possible to explain the underlying true log VN titre or log HI assay values by looking at the difference in the protein structure of the two viruses. This difference can be attributed to the presence (1) or absence (0) of an amino acid substitution at each specific residue which is exposed on the surface of the capsid or virus shell. This information can be added into the model as fixed-effects and the selection of a particular residue, or variable, indicates its importance in explaining antigenicity. Residues in the FMDV datasets are given by their protein sequence alignment (Reeve et al. 2010), where for example VP3 138 represents position 138 on the VP3 protein. Residues in the H1N1 dataset are given by their position on the H1 common alignment (Harvey et al. 2016).

However, sometimes we observe antigenic differences between virus lineages that we unable to attribute to amino acid changes at any specific residue. In these cases we wish to relate the changes to the evolutionary history of the virus. We do this through the inclusion of variables related to different properties of the branches of the phylogenetic tree. The reconstruction of phylogenetic trees is not the subject of this article, and therefore we use trees generated from the structural proteins of SAT1, SAT2 and H1N1 viruses presented by Reeve et al. (2010, 2016) and Harvey et al. (2016). Where possible, for each of the branches we include variables related to the effect of the challenge and protective strains, as well as to account for unknown antigenic effects; see Section 6 of the supplementary details for more information.

4.1 Simulated data

To test the effectiveness of the methods described in Sects. 2 and 3 , we have used a simulation study. We simulated 12 sets of simulated data based on the structure of the FMDV datasets. For the first 9 sets of simulated data, the datasets contain 100 measurements for training and 900 measurements for testing. For each of these sets of simulated data we varied the number of variables, \(||\mathbf{w }|| \in \{40,60,80\}\), and the size of the error, \(\sigma _\varepsilon ^2 \in \{0.01,0.1,0.3\}\), to test the methods under different circumstances. The final 3 sets of simulated data were used to test the model under the \(p > N\) setting with each of the datasets containing 50 measurements for training and 900 for testing. We then set \(||\mathbf{w }||=200\) and again varied the size of the error, \(\sigma _\varepsilon ^2 \in \{0.01,0.1,0.3\}\). Additionally in each of the 12 sets of simulated data we added two groups of random effects to each dataset to represent experimental variation, both with 8 levels.

To reflect the fact that we expect many of the variables to have no influence on the response we drew a probability \(\pi \) from \({\mathcal {U}}(0.2,0.4)\) for each of the first 9 sets of data and fixed it to \(\pi =0.05\) for the other 3 sets of data. The range of values for \(\pi \) in the first 9 sets of data reflects the values we expect to see for the FMDV and Influenza datasets, while the value for the other sets of data, \(\pi =0.05\), represents a value found regularly in the literature for other biological problems. With this probability, each of the variables in the dataset was then given a regressor simulated from \({\mathcal {U}}(-0.4,-0.2)\) and zero otherwise, remembering that we expect the variables to have a negative effect as any mutational changes will reduce the response, VN titre. Each response \(y_i\) was then generated with an intercept of 10 and perturbed with \({\mathcal {N}}(0,0.02)\) iid additive Gaussian noise.

4.2 SAT1 data

The original SAT1 dataset analysed in Reeve et al. (2010) is made up of 246 VN titre measurements of comparisons between 3 protective and 20 challenge strains. For each of these measurements, there are 754 residues in the amino acid sequence of the structural proteins. Of these 306 are exposed on the surface of the capsid, and 137 are variable between the 20 test viruses, producing usable indicator variables to assess the antigenic effect of amino acid substitutions. The phylogenetic tree contains 38 branches, and we have included variables on the phylogenetic trees to account for the type of effect where possible (see Section 6 of the supplementary materials), resulting in 64 different indicator variables to help determine the effect of each branch. To complete the analysis we removed groups of variables with correlation coefficients of one, leaving only one variable included but using information from the whole group in order to classify the included variable (see Table 3 and Section 7 of the online supplementary materials). This gave the final dataset 138 variables in total. Random effects were included to account for the antiserum and challenge strain.

4.3 Extended SAT1 data

After the original analysis of the SAT1 data in Reeve et al. (2010), more data was collected, including additional strains and repeated experiments, for the SAT1 serotype (Reeve et al. 2016). This data includes the original SAT1 data and consists of a total of 2125 VN titre measurements with 5 protective and 42 challenge strains. Of the 306 surface exposed sites, the amino acid sequence is variable between the viruses at 146. 132 variables were also provided from the phylogenetic structure. Once groups of variables with correlation coefficients of one were removed, 221 variables were left in the model. Random effects were included to account for the antiserum, challenge strain and date of the experiment.

4.4 SAT2 data

The SAT2 data was originally analysed in Reeve et al. (2010) and contains 320 VN titre measurements of 4 protective and 22 challenge strains. It contains data on 128 variable surface exposed residues and 80 variables associated with different types of phylogenetic changes. After removing variables with correlation coefficients of one as before, this left 148 different variables to be included in the model. Random effects were included to account for the antiserum and challenge strain.

4.5 Influenza A (H1N1) data

Harvey et al. (2016) used a H1N1 dataset that contained 506 challenge strains and 43 protective strains. Here we have uses a slightly smaller dataset in order to fully account for the effect of the phylogenetic structure; see Section 6 of the supplementary materials. The dataset used here contains 15,693 HI assay measurements with 43 challenge and 43 protective strains. As this full dataset is too large to analyse using the conjugate SABRE method we have summarised the data to 570 mean HI assay measurement for each combination of challenge and protective strains. For each pair of challenge and protective strains the 279 explanatory variables, 53 surface exposed residues and 226 variables related to the phylogenetic data, remain the same. Doing this however means we cannot use the date of the experiment as a random effect group and additionally the dataset does not contain antiserum data, meaning we have only used the challenge strain as a random effect group.

5 Computational inference

Our code for the classical mixed-effects models has been implemented in R (R Core Team 2013) using the package lme4 (Bates et al. 2015). To choose these models we used forward inclusion while adjusting for multiple testing using the Holm–Bonferroni correction. The code for the mixed-effects LASSO, mixed-effects elastic net and the SABRE methods was written directly into R and the software is freely available from the authors upon request.

For our MCMC chains, we sampled 10,000 and 15,000 iterations respectively for the simulated and real data before removing an appropriate portion for burn-in. This was determined by running 4 chains for each model and computing the potential scale reduction factor (PSRF) (Gelman and Rubin 1992) from the within-chain and between-chain variances (Plummer et al. 2006). We take a PSRF \(\le 1.05\) as a threshold for convergence and terminate the burn-in when this is consistently satisfied for 95% of the variables. In general, the fixed hyper-parameters, shown as grey nodes in Fig. 1, were set to give a vague distribution for the flexible (hyper-)parameters, shown as white nodes. The only exception was the prior on \(\pi \), defined in (17), which was set to be weakly informative such that \(\alpha _\pi =1\) and \(\beta _\pi =4\). This corresponds to prior knowledge that only a small number of residues or branches have a significant antigenic effect.

The following hyper-parameters, shown as grey nodes in Fig. 1, are fixed to give vague distributions: \(\alpha _{b,g}=\beta _{b,g}=\alpha _{\eta ,g}=\beta _{\eta ,g}=0.001\) and \(\mu _{b,g} = \mu _{\eta ,g} = 0\) for all g, \(\alpha _{w,h}=\beta _{w,h}=0.001\), \(\mu _{0,h}=0\) and \(\sigma ^2_{0,h}=100\) for all h, \(\mu _\xi =0\), \(\sigma ^2_\xi = 100\), \(\mu _{w_0} = max (\mathbf{y })\), \(\sigma ^2_{w_0} = 100\) and \(\alpha _\varepsilon =\beta _\varepsilon =0.001\). The only unusual choice is \(\mu _{w_0} = max (\mathbf{y })\) which follows from us expecting a high intercept with the regression coefficients then having a negative effect on the response. This is a result of strains having high reactivity with themselves, and any changes making the strains less similar, reducing their reactivity.

To analyse the best proposal method we tested the component-wise Gibbs sampler and several specifications of the Metropolis–Hastings sampler on the original SAT1, extended SAT1 and H1N1 datasets. For the Metropolis–Hastings sampler, we proposed the inclusion or exclusion of the variables in groups of 5, 10, 15, 20 and 30. We analysed convergence by monitoring the percentage of variables with a PSRF \(\le 1.05\), similar to Grzegorczyk and Husmeier (2013).

For selecting variables in the mixed-effects LASSO and elastic net we used BIC as in Schelldorfer et al. (2011). For the SABRE method there are a variety of techniques that have been used in the literature to choose a cut-off. Often a cut-off of 0.5 is used and this has been shown to be the best predictive model under strict conditions (Barbieri and Berger 2004). Alternatively the top \(J{\hat{\pi }}\) ranked variables have been taken, where J is the number of variables and \({\hat{\pi }}\) is the posterior mean of \(\pi \), defined in (16) and (17), i.e. the global probability of variables being included in the model.

6 Results and discussion

To recapitulate, we have introduced a hierarchical Bayesian modelling framework (called SABRE) for selecting relevant antigenic sites in viral evolution. There are two fundamentally different approaches to variable selection: the slab and spike prior, whereby the influence of an input variable is controlled via the prior distribution of its associated regression parameters, and the binary mask model, where variables are put through a binary multiplicative filter. There are also different prior distributions one can choose: a conjugate prior, and a semi-conjugate prior. This gives us three variants of the proposed modelling framework:

  • The conjugate SABRE model, with slab and spike prior

  • The semi-conjugate SABRE model, with slab and spike prior

  • The binary mask SABRE model.

These three variants are depicted as probabilistic graphical models in Fig. 1 of the main paper, and in Figures 1 and 3 of the supplementary material. We have compared their performance with that of two established methods from the literature: the mixed-effects model with stepwise variable selection, and the mixed-effects LASSO. Since there are indications from the literature that the elastic net offers an improvement over the LASSO, we have also modified the mixed-effects LASSO model from the literature (Schelldorfer et al. 2011) by a novel mixed-effects elastic net model. This gives us three classical methods for comparison:

  • Mixed-effects model with stepwise variable selection

  • Mixed-effects LASSO model

  • Mixed-effects elastic net model.

We have applied and assessed the proposed methods with a three-pronged approach. Firstly, we have tested them on a large set of synthetic benchmark data, where the true structure of the model is known, and it is therefore straightforward to quantify the accuracy of inference. This is discussed in Sect. 6.1. Secondly, we have applied the methods to real data for which partial biological prior knowledge is known, which can be used to partially assess the model predictions. These findings are presented in Sect. 6.2. Finally, in Sect. 6.3, we present novel applications to new data, from a less well known serotype of FMDV and the H1N1 serotype of the Influenza virus. Here, little reliable biological prior knowledge is available, and the purpose of our study is new hypothesis generation.

As discussed in Sect. 3.4, we have also tested the choice of the random-effects prior on the SAT2 dataset. Figure 3a, b show posterior samples of the log variance of the two random-effects groups from the conjugate SABRE method comparing the half-t and Inverse-Gamma priors, and shows no notable differences. Similarly Fig. 3c shows that the inclusion probabilities for the two competing models are approximately the same. Based on these findings, we only report the results obtained with the conjugate Inverse-Gamma prior.

Fig. 3
figure 3

Gaussian Kernel density estimation plots of random effects variances and a comparison of posterior inclusion probabilities. Gaussian kernel density estimation plots are shown for the sampled posterior densities of the log random effect variance. This is given for the two groups of random effects, a challenge strain and b antiserum, under a vague Inverse-Gamma prior (solid) and the half-t prior (dotted) proposed in Gelman (2006). c Plot showing the comparative posterior inclusion probability for each variable for the two models

6.1 Simulated data with known ground truth

Table 1 compares the different methods in terms of variable selection, Widely Applicable Information Criterion (WAIC) score (Watanabe 2010), predictive performance and fixed effects coefficients inference using the simulated datasets described in Sect. 4.1.Footnote 1 To measure variable selection we have ranked the covariates in terms of their significance or influence. For the Bayesian methods, the ranking is defined by the marginal posterior probabilities of inclusion. For the alternative methods, we explain the way the ranking is obtained below. Since for the simulated data the true covariates are known, this ranking can be used to produce a receiver operating characteristic (ROC) curve (e.g. Hanley and McNeil 1982; Section 5.7. of Murphy 2012), where for all possible values of the inclusion threshold, the sensitivity or recall (the relative proportion of true positive covariates: TP/(TP + FN)) is plotted against the complementary specificity (the relative proportion of false positive covariates: FP/(FP + TN)).Footnote 2 By numerical integration we obtain the area under the ROC curve (AUROC) as a global measure of accuracy, where larger values indicate a better performance, starting from AUROC = 0.5 to indicate random expectation, to AUROC = 1 for perfect variable identification.

Table 1 Table of simulation study results

In addition to ranking the covariates to get ROC curves for the SABRE methods, we also need to rank the alternative established methods for a comparison. For the classical mixed-effects models this is done by removing the significance threshold and ranking the edges by order of inclusion. For the mixed-effects LASSO and elastic net we predicted models for a variety of different penalty parameters, \(\lambda \), to create the so called LASSO path and create a ranking based on when variables become 0. For the mixed-effects elastic net we only show the results for \(\alpha =0.3\) following Ruyssinck et al. (2014), however the remaining results are available in Section 8 of the online supplementary materials. Alternative AUROC values based on using model selection and then ranking the variables based on the absolute values of the regression coefficients (Aderhold et al. 2014), as well as other results, are also available in Section 8 of the online supplementary materials.

Table 1 also measures the accuracy of predicting out of sample observations, \(\mathbf{y }_\mathbf{out }\), and the fixed effects coefficients, \(\mathbf{w }\) in terms of Mean Squared Errors (MSEs). For the Bayesian methods, the predictions are made by sampling from the model and then choosing which variables are included based on taking the top \(J \times {\hat{\pi }}\) variables with the highest inclusion probabilities. The model is then sampled with just those variables set to be included and the estimates calculated. For the mixed-effects LASSO, mixed-effects elastic net and classical mixed effects models the regression coefficients can be taken from the chosen model. The random effects coefficients can then be calculated using the best linear unbiased estimator and predictions of the out of sample observations, \(\mathbf{y }_\mathbf{out }\), made.

In terms of variable selection, the AUROC values shown in Fig. 4 and Table 1 show that all the SABRE methods outperform the alternative methods; the mixed-effects LASSO, the mixed effects elastic net and the classical mixed effects models. This is achieved across all datasets and is highlighted in Fig. 5, which compares the difference in AUROC values obtained by the different methods and that of the conjugate SABRE method. A negative score signifies a reduction in performance compared to the conjugate SABRE method. Figure 5 shows that the conjugate SABRE method performs significantly better than the mixed-effects LASSO, the mixed-effects elastic net and the classical mixed-effects models in all sets of data.

The performance in terms of predicting out of sample observations and inferring fixed effects coefficients shown in Table 1 again shows the SABRE methods outperforming the alternative methods in most cases. Table 1 shows a huge improvement for the SABRE methods in all cases except where both the error variance and number of variables is small. This is especially the case with the mixed-effects LASSO and the mixed-effects elastic net where the reliance on \(\ell _1\) regularisation causes a bias which affects both the inference of the fixed effects coefficients and the variable selection, as well as subsequently the out of sample predictions. The alternative methods do outperform the SABRE methods in some sets of data, however this is limited to a small number of cases which can mainly be explained by the model selection technique used with the SABRE methods not accurately selecting the correct cut-off when the AUROC value is close to 1.

Fig. 4
figure 4

Bar plots of AUROC values from the Simulation Study Results in Table 1. The bar plots give AUROC values for the Conjugate (C), Semi-Conjugate (SC) and Binary Mask Conjugate (BM C) SABRE methods (black bars), the mixed-effects (M-E) LASSO, the mixed-effects elastic net (M-E EN) with \(\alpha =0.3\) (both grey bars) and the classical mixed-effects models (white bars) applied to the simulated data described in Sect. 4.1. a \(N=100\), \(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.03\). b \(N=100\), \(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.1\). c \(N=100\), \(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.3\). d \(N=100\), \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.03\). e \(N=100\), \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.1\). f \(N=100\), \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.3\). g \(N=100\), \(||\mathbf{w }|| = 80\), \(\sigma _\varepsilon ^2 =0.03\). h \(N=100\), \(||\mathbf{w }|| = 80\), \(\sigma _\varepsilon ^2 =0.1\). i \(N=100\), \(||\mathbf{w }|| = 80\), \(\sigma _\varepsilon ^2 =0.3\). j \(N=50\), \(||\mathbf{w }|| = 200\), \(\sigma _\varepsilon ^2 =0.03\). k \(N=50\), \(||\mathbf{w }|| = 200\), \(\sigma _\varepsilon ^2 =0.1\). l \(N=50\), \(||\mathbf{w }|| = 200\), \(\sigma _\varepsilon ^2 =0.3\)

Fig. 5
figure 5

Box plots of the difference in AUROC values for each method in comparison to the conjugate SABRE method. The box plots give the difference in AUROC values for each of the methods after the AUROC value of the conjugate SABRE method has been subtracted for the appropriate dataset. Negative values indicate that the conjugate method has outperformed the alternative method. Each box plot contains 100 datasets as described in Sect. 4.1. The alternative methods are the Semi-Conjugate (SC) and Binary Mask Conjugate (BM C) SABRE methods, the mixed-effects (M-E) LASSO, the mixed-effects elastic net (M-E EN) with \(\alpha =0.3\) and the classical mixed-effects models. a \(N=100\), \(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.03\). b \(N=100\), \(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.1\). c \(N=100\), \(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.3\). d \(N=100\), \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.03\). e \(N=100\), \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.1\). f \(N=100\), \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.3\). g \(N=100\), \(||\mathbf{w }|| = 80\), \(\sigma _\varepsilon ^2 =0.03\). h \(N=100\), \(||\mathbf{w }|| = 80\), \(\sigma _\varepsilon ^2 =0.1\). i \(N=100\), \(||\mathbf{w }|| = 80\), \(\sigma _\varepsilon ^2 =0.3\). j \(N=50\), \(||\mathbf{w }|| = 200\), \(\sigma _\varepsilon ^2 =0.03\). k \(N=50\), \(||\mathbf{w }|| = 200\), \(\sigma _\varepsilon ^2 =0.1\). l \(N=50\), \(||\mathbf{w }|| = 200\), \(\sigma _\varepsilon ^2 =0.3\)

We have also explored multiple different versions of the SABRE method, namely the semi-conjugate (Figure 1 of the online supplementary materials), conjugate (Fig. 1) and binary mask conjugate (Figure 3 of the online supplementary materials) SABRE methods. As far as we are aware the quantitative comparison between a spike and slab based method and a binary mask based one is the first of its kind. Our results given in Table 1, as well as Figs. 4 and 5 , show a strong similarity in performance between the methods. The comparison of AUROC values given in Fig. 5 clearly shows a large overlap in both methods’ variable selection performance and this is backed up by the paired t-tests given in Table 5 of the online supplementary materials. The only exception to this is the AUROC values for the sets of data with \(n=50\) and \(||\mathbf{w }=200||\), which show a small significant improvement for the spike and slab model. This result is not repeated across different datasets so it is logical to conclude that these models perform similarly in general. Identifying that these methods give similar results is important, as in practise both methods are discussed and used throughout the literature, e.g. Murphy (2012), and Jow et al. (2014).

We have also compared the conjugate and semi-conjugate SABRE models, as depicted in Fig. 1 here and Figure 1 in the online supplementary materials. Overall, our results, shown in Table 1 and Tables 4–7 of the supplementary material, suggest that the two methods perform similarly across the wide range of simulated data sets. A paired t-test, summarised in Table 6 of the supplementary material, identifies two data sets (\(||\mathbf{w }|| = 40\), \(\sigma _\varepsilon ^2 =0.3\); \(||\mathbf{w }|| = 60\), \(\sigma _\varepsilon ^2 =0.3\)) where the conjugate SABRE model outperforms the semi-conjugate SABRE model. Formal model selection based on WAIC also shows a slight, but significant preference for the conjugate model in 9 out of 12 sets of data (see Tables 4–7 of the online supplementary material).

The final contribution of our simulation study is to test whether the use of the collapsed sampling scheme in conjunction with increased conjugacy achieves an improvement in terms of MCMC mixing and convergence. Figure 6 indicates that a slight improvement is achieved with the conjugate SABRE model over the semi-conjugate one when the methods were tested on the sets of data with \(n=100\). However, this difference is not statistically significant, as becomes clear when considering the confidence intervals (not shown in Fig. 6 to avoid clutter). This finding suggests that the major bottleneck in the MCMC sampling scheme is caused by the latent variables \(\varvec{\gamma }\) rather than the regression parameters.

Fig. 6
figure 6

Convergence diagnostics for the combined simulated datasets. Convergence diagnostics for the conjugate SABRE method with the collapsed sampling scheme (CSS) (solid line), the semi-conjugate SABRE method without CSS (crosses) and the BM conjugate SABRE method with CSS (circles). The lines show the proportion of parameters converged (PSRF \(< 1.05\)) versus the number of iteration of the 4 MCMC chains. The proportion is based on all of the simulated datasets with \(n=100\) from Sect. 4.1

6.2 Real data with partial ground truth

Both SAT1 datasets have been analysed using classical mixed-effects models. Originally Reeve et al. (2010) analysed the original SAT1 dataset (Sect. 4.2) and Reeve et al. (2016) investigated an extended version of this dataset (Sect. 4.3). We have used our method on each of these datasets in order to identify a number of candidate residues which could be considered important for understanding antigenic variability. Knowledge of which residues are antigenically important is partially incomplete. Therefore, for validation purposes, residues were assigned to three different groups, proven, plausible and implausible, based on how likely they are to be antigenic based on experimental results.

For the SAT1 FMDV serotype, residues are included in the experimentally proven group for three different reasons. Firstly we include any residues which have been experimentally validated as important within the SAT1 serotype by monoclonal antibody escape mutant studies (MAbs) (Grazioli et al. 2006). Secondly, we include those residues which are part of cords of connected experimentally validated antigenic residues for four or more different serotypes; VP1 140–169 (part of the VP1 G-H loop), VP1 200–224 (VP1 C terminus), VP2 70–82 (VP2 B-C loop) and VP3 56–61 (VP3 B-B knob) (Aktas and Samuel 2000; Barnett et al. 1989; Crowther et al. 1993a; Baxt et al. 1989; Bolwell et al. 1989; Grazioli et al. 2006, 2013; Lea et al. 1994; Kitson et al. 1990; Mateu 1995; Saiz et al. 1991; Thomas et al. 1988a, b). As antigenic sites have been found in a large number of different individual locations, we include additional information from other serotypes when classifying whole loops due the similar structure of the different serotypes. Finally, we also include a number of topotype-defining branches that are known to represent significant changes in the evolutionary history (Reeve et al. 2010).

We define the plausible group to consist of residues from any protein loop where residues have been identified in three or less FMDV serotypes, excluding those residues that are already classified as proven. Additionally, any non-topotype-defining branches of the phylogenetic trees are included in the plausible group, as it is unknown which of the remaining branches may also be significant in evolutionary history of the serotype. Finally we classify any residues not included in these groups as implausible.

It is common that variables have correlation coefficients exactly equal to one. In this case we only include one of the variables in the model and use Table 3 of the online supplementary materials to guide the classification into the proven, plausible and implausible groups, as explained in Section 7 of the online supplementary materials.

6.2.1 SAT1 data

The analysis of the original SAT1 dataset has resulted in the identification of 29 residues or branches of importance based on taking the top \(J {\hat{\pi }}\) variables with the highest marginal posterior inclusion probabilities. 9 of the selected residues and 2 of the branches are classified as proven, at the expense of only 1 implausible variable. A full list of selected variables can be found in Section 8.2 of the online supplementary materials. The proportion of the differently classified variables at different cut-off points is shown in Fig. 7a. The proven residues include several that have been validated using MAbs in the SAT1 serotype (Grazioli et al. 2006), as well as others from the VP2 B-C loop, VP1 G-H loop and VP1 C-terminus (the end of the VP1 protein) and we have focused on these proven residues in our analysis.

The residues that have been experimentally validated in the SAT1 serotype are VP3 71 and VP3 77 in the VP3 B-C loop and VP1 144 and VP1 149 in the VP1 G-H loop (Grazioli et al. 2006). Additionally in the VP1 G-H loop, an antigenic loop in every FMDV serotype (Bolwell et al. 1989; Crowther et al. 1993b; Grazioli et al. 2006, 2013; Kitson et al. 1990; Lea et al. 1994) known to distract the host immune systems, the conjugate SABRE method has also identified VP1 143 and VP1 150. These residues are next to the experimentally validated residues in the protein alignment and confirm that the VP1 G-H loop is a highly antigenic part of the SAT1 serotype.

In addition to the residues in the VP3 B-C and VP1 G-H loops, the conjugate SABRE method has additionally selected VP2 74 in the VP2 B-C loop, as well as VP1 216 and VP1 219 in the VP1 C-terminus. The VP2 B-C loop is antigenic in all serotypes and contains the highly antigenic VP2 72 residue, which has been experimentally validated in all of the FMDV serotypes except SAT2 (Aktas and Samuel 2000; Crowther et al. 1993a; Grazioli et al. 2006, 2013; Kitson et al. 1990; Lea et al. 1994; Saiz et al. 1991). The VP1 C-terminus has been proven to be antigenic in all but the Asia1 serotype, although it is almost certainly antigenic there also (Aktas and Samuel 2000; Baxt et al. 1989; Grazioli et al. 2006; Mateu 1995).

Fig. 7
figure 7

Proportion of categorised SAT1 variables included based on different cut-off values for posterior inclusion probability. The graph shows the proportion of the experimentally proven (thick solid line), plausible (solid line) and implausible (dashed line) variables based on a cut-off value for the posterior inclusion probability. The variables were classified into groups based on the method outlined in the first 3 paragraphs of Sect. 6.2. Cut-offs are marked at 0.5 posterior inclusion probability (vertical dashed line) and the posterior inclusion probability equivalent to the top \(J {\hat{\pi }}\) variables with the highest posterior inclusion probabilities (vertical dotted line). a Original SAT1, b extended SAT1

Figure 8a shows the model predictions for the antigenically significant branches based on using just the branch variables from the original SAT1 dataset. Here we have identified all of the branches known to divide topotypes (Reeve et al. 2010), as well as a number of other branches. Several of the branches, including two topotype defining branches, have been specifically identified as reactivity, immunogenic or antigenic changes, an improvement over previously used models.

6.2.2 Extended SAT1 data

The analysis of the extended SAT1 dataset resulted in selecting 76 variables, which included 24 proven residues, 4 important branches in the evolutionary history and only 2 implausible residues. A full list of the selected variables can again be found in the online supplementary materials, Section 8.2, and the proportion of proven, plausible and implausible residues selected at different cut-offs is shown in Fig. 7b here. The improved results over Sect. 6.2.1 show the advantage of getting a larger dataset through testing an increased number of strains under a variety of different experimental conditions.

The conjugate SABRE method has identified 11 residues in the highly variable VP1 G-H loop (VP1 142, VP1 143, VP1 144, VP1 147, VP1 148, VP1 149, VP1 150, VP1 155, VP1 156, VP1 163 and VP1 164). Finding this many significant residues in this highly antigenic region while keeping the number of implausible residues low shows that the model is working effectively.

Additionally, like with the original SAT1 dataset in Sect. 6.2.1, the conjugate SABRE method has selected VP2 74 from the VP2 B-C loop. However in addition it has also selected VP2 72 which is antigenic in all FMDV serotypes and VP2 79 which has been experimentally validated in the A, O, Asia1 and SAT2 serotypes (Grazioli et al. 2006, 2013; Mateu 1995). The conjugate SABRE model also again selects several residues from the VP1 C-terminus; VP1 209, VP1 211 and VP1 218.

The final proven residues are from the VP3 B-B knob or have been experimentally validated specifically in the SAT1 serotype (Grazioli et al. 2006). In the VP3 B-B knob the conjugate SABRE method has identified VP3 58 (serotypes A, O, C and Asia1) and VP3 61 (serotype A) (Grazioli et al. 2006; Lea et al. 1994; Mateu 1995). From those residues which have specifically been validated in the SAT1 serotype, again VP3 71 and VP3 77 from the VP3 B-C loop have been selected. However for the extended SAT1 dataset, the conjugate SABRE method has also selected VP3 138, which was also found in Reeve et al. (2010), from VP3 E-F loop.

Fig. 8
figure 8

Phylogenetic trees indicating significant branches in the evolutionary history of the SAT1 serotype. Phylogenetic trees were created using BEAST v1.7.2 and FigTree v1.4.2 from aligned nucleotide sequence data with date of isolation. Marked on the tree are protective strains (asterisk) and topotype defining branches (dashed vertical line). Branches inferred by the SABRE method are highlighted (black). Symbols indicate whether this was inferred to be a change in virus antigenicity (dagger), virus reactivity (double dagger) or virus immunogenicity (section mark). Where a highlighted branch has no symbol, an associated change in antigenicity or reactivity could not be discriminated between. The cut-off for significance is discussed in Sects. 6.2.1 and 6.2.2. a Original SAT1, b extended SAT1

As well as finding some branches in our overall model (including 4 topotype defining branches identified as representing significant evolutionary changes a priori), we have also compiled a model based only on branches to help us understand the evolutionary history of the serotype. The results of this model are given in Fig. 8b, where the seven branches known to define topotypes are indicated by the vertical line. In order to produce more interpretable results, where larger groups of strains are not separated by a significant evolutionary change (selected branch), we have used a cut-off of 0.5. The full results using a \(J {\hat{\pi }}\) cut-off are given in Figure 5 of the online supplementary materials. The results given in Fig. 8b show that we have been able to identify all but one of the topotype defining branches, while the other is found when the \(J {\hat{\pi }}\) cut-off is used. We have also been able to specify whether the evolutionary changes have affected virus antigenicity, immunogenicity or reactivity, helping us to further understand the underlying biological processes.

6.2.3 Comparison with previous work

To compare the results of the SABRE method against the mixed-effects models used in Reeve et al. (2010, 2016) and Harvey et al. (2016), we examine which categories (proven, plausible or implausible) the various residues selected fall into. Note that to do this we ignore any branch terms that do not directly correspond to a residue term. The full results for variables selected can be found in Tables 6, 8,10 and 13 of Section 8 in the online supplementary materials. For comparison, the results of Reeve et al. (2016) are given in Table 12 of the online supplementary materials, as the results of the equivalent study are not given in the original paper. We also note that the results of Harvey et al. (2016) are not completely comparable as they were obtained from a larger dataset.

For the original SAT1 dataset, Reeve et al. (2010) selected 0 proven, 0 plausible and 0 implausible residues using the method described in Sect. 2 (i.e. when the Holm–Bonferroni correction was used). These results compare to 1 proven, 1 plausible and 0 implausible residues when the conjugate SABRE method was used and selecting any residue variables with a marginal posterior inclusion probability of greater than or equal to 0.5.Footnote 3 We have also looked at how well the methods do before selecting an implausible variable or before a p value of greater than 0.05 (before the Holm–Bonferroni correction was used) was reached (in Reeve et al. (2010) the variable selection process was stopped as soon as a 0.05 p value was reached). In this situation again the conjugate SABRE method offers an improvement, selecting 5 proven, 5 plausible and 0 implausible residues compared to 1, 1 and 0 respectively for the classical mixed-effects models. The difference in these results shows an advantage for the conjugate SABRE method over the classical mixed-effects models.

In the extended SAT1 dataset, Reeve et al. (2016) used the method in Sect. 2 to select 5 proven, 0 plausible and 0 implausible residues, or 8, 1 and 0, respectively, if the method continued until selecting the first implausible residue. The conjugate SABRE method selected 11 proven, 3 plausible and 0 implausible residues when taking any variables with marginal posterior inclusion probabilities of greater than or equal to 0.5, or 15, 4 and 0, respectively, before selecting the first implausible residue.Footnote 4 It can again be seen that the power of the proposed SABRE method has improved over the method of Reeve et al. (2010).

Fig. 9
figure 9

Convergence diagnostics for the original SAT1 and extended SAT1 datasets. The lines show the proportion of parameters that have converged (PSRF \(< 1.05\)) versus the average CPU time (second) when using component-wise Gibbs sampling (crosses) and Metropolis–Hastings sampling proposing 5 (solid), 10 (dashed), 15 (dotted), 20 (thick solid) and 30 (thick dotted) inclusion parameters simultaneously. a Original SAT1, b extended SAT1

6.2.4 Sampling of latent indicators

Figure 9 compares component-wise Gibbs sampling (Sect. 3.6.1) against block Metropolis–Hastings sampling (Sect. 3.6.2) in terms of speed of convergence. To do this we ran 4 chains for the component-wise Gibbs sampler and each of the 5 variations of the Metropolis–Hastings sampler, monitoring the PSRFs for each parameter in the different methods. Figure 9 shows the proportion of parameters with PSRFs \(< 1.05\) in each case compared with the CPU time taken to get that number of samples. The higher the proportion of parameters with PSRFs \(< 1.05\), the better the method is said to have performed.

The results from Fig. 9 support the advantage of a block Metropolis–Hastings sampler over a component-wise Gibbs sampler. In all of the datasets the block Metropolis–Hastings samplers have outperformed the component-wise Gibbs sampler, with the exception of when more than 40 or 50 variables were sampled at a time (not shown in the diagrams for clarity). This shows that even sampling a reasonably large number of variables simultaneously, where the acceptance rate is likely to be low, can still yield a notable improvement. The resultsFootnote 5 in Fig. 9 suggest that as a rule of thumb, sampling about 10 or 15 (or around 7%) of the variables at a time will lead to effective sampling with the quickest convergence.

6.3 Real data for novel predictions

Little knowledge is available on how mutational changes affect antigenic variability for the SAT2 FMDV serotype and the H1N1 Influenza virus. We have therefore applied our conjugate SABRE method as a tool for new hypothesis generation.

6.3.1 SAT2 data

Although knowledge of the SAT2 FMDV serotype is minimal, for validation purposes we can exploit knowledge gained from other serotypes of FMDV and previous work on the SAT2 serotype. Grazioli et al. (2006) and Crowther et al. (1993b) has found evidence for antigenicity of the following three areas of the SAT2 capsid: VP1 140–169 (part of the VP1 G-H loop), VP1 200–224 (VP1 C terminus) and VP2 70–82 (VP2 B-C loop). A full list of the variables selected by the conjugate SABRE method can be found in the online supplementary materials, Section 8.2.

Firstly in the VP2 B-C loop, the conjugate SABRE method has identified 5 residues that are antigenic; VP2 71, VP2 72, VP2 78, VP2 79 and VP2 80 (Grazioli et al. 2006, 2013; Kitson et al. 1990; Lea et al. 1994; Saiz et al. 1991). Of these VP2 78 has been experimentally identified using MAbs (Grazioli et al. 2006). Additionally VP2 72 is known to be antigenic in all other serotypes and these results suggest it is also antigenic in the SAT2 serotype (Grazioli et al. 2006, 2013; Mateu 1995).

The second region in which antigenically significant residues have been found is in the VP1 G-H loop. The VP1 G-H loop is known to be a highly variable distracter site designed to confuse the host immune system (Crowther et al. 1993b) and is antigenic in all of the FMDV serotypes. In this loop, the conjugate SABRE method has specifically identified VP1 144 and VP1 166, where it is notable that VP1 166 lies directly between several residues that have been experimentally validated in the SAT2 serotype using MAbs (Crowther et al. 1993b).

Fig. 10
figure 10

Phylogenetic trees indicating significant branches in the evolutionary history of the SAT2 serotype. The phylogenetic tree was created using BEAST v1.7.2 and FigTree v1.4.2 from aligned nucleotide sequence data with date of isolation. Marked on the tree are protective strains (asterisk). Branches associated with a change in virus phenotype are highlighted (black). Symbols indicate whether this was inferred to be a change in virus antigenicity (dagger), virus reactivity (none-identified) or virus immunogenicity (section mark). Where a highlighted branch has no symbol, an associated change in antigenicity or reactivity could not be discriminated between. The cut-off for significance was taken to be the \(J {\hat{\pi }}\) variables with the highest marginal inclusion probability

The final known antigenic region that has been identified by the conjugate SABRE method is part of the VP1 C-terminus, the end of the VP1 protein. In the VP1 C-terminus we have identified VP1 207, VP1 208, VP1 209, VP1 210 and VP1 211 which are part of a region known to be antigenic in all FMDV serotypes except Asia1 (Aktas and Samuel 2000; Grazioli et al. 2006; Lea et al. 1994; Saiz et al. 1991). With the conjugate SABRE method identifying all these neighbouring residues, it suggests that this section of the protein is a highly antigenic part of the SAT2 serotype.

Figure 10 gives the phylogenetic tree for the SAT2 serotype with the predicted significant evolutionary changes. Unlike the SAT1 serotype, there is no prior knowledge of which residues and branches are antigenically relevant and we therefore apply our method to generate genuinely new hypotheses. The results presented give our best prediction for the significant branches and show a couple of potentially interesting groupings which could represent functional groups for the SAT2 serotype.

6.3.2 Influenza A (H1N1) data

Knowledge of the H1N1 Infuenza serotype is restricted to a few experimental results and some knowledge about where antigenic sites are likely to occur. In general the virus can be divided into the head and stalk domains, with the head domain most likely to contain antigenic residues within four major antigenic sites (Sa, Sb, Ca and Cb) (Caton et al. 1982). But knowledge is incomplete as to which areas of the head domain contain antigenic residues beyond that. We have applied the conjugate SABRE method to the H1N1 dataset and a full list of the variables selected can be found in the online supplementary materials, Section 8.3.

Of those variables selected by the conjugate SABRE method, one residue was identified on the Receptor Binding Site (position 187 on the H1 common alignment), the main binding site for the H1N1 virus (Skehel and Wiley 2000). Additionally we have identified 4 other residues that are nearby (positions 130, 153, 189 and 190), including one on the Sa antigenic site, two on the Sb and one that is known to be a location of a major antigenic change (130) (Harvey et al. 2016). 6 other residues are identified on the Ca and Cb antigenic site (positions 69, 72 and 74 on the Cb; positions 139, 141 and 142 on the Ca). Other residues were also found near to the Ca and Cb antigenic sites, but with a lack of experimental knowledge about the H1N1 virus it is impossible to tell whether they are antigenic sites without experimental validation.

7 Conclusion

We have addressed the problem of identifying the residues within the SAT1 and SAT2 serotypes of FMDV and Influenza A (H1N1) that are responsible for changes in antigenic variability. This allows us to identify which residues must remain the same in order for two strains to cross react and for one strain to potentially be used as an effective vaccine against another. Identifying such residues can reduce the number of strains that must be tested as a vaccine, potentially reducing the time and cost associated with the selection procedure.

We have proposed a sparse hierarchical Bayesian model for detecting relevant antigenic sites in virus evolution (SABRE) and shown how it offers improvement over the classical mixed-effects model, the mixed-effects LASSO and the mixed-effects elastic net. There are four reasons for this improvement. The proposed hierarchical modelling framework with slab-and-spike prior (1) avoids the bias inherent in Lasso-type methods, (2) genuinely and consistently achieve sparsity, (3) properly accounts for uncertainty at all levels of inference, and (4) borrows strength from information coupling, whereby all parameters are systematically and iteratively inferred in the context of all other parameters. In some more detail: (1) The shrinkage effect inherent in the \(\ell _1\) penalty term introduces a bias by which the regression parameters are systematically underestimated. This bias is avoided with the slab and spike prior that we use. (2) The LASSO is known to only give sparse solutions at the MAP (maximum a posteriori) configuration, but not when sampling parameters from the posterior distribution. From a Bayesian perspective, the MAP is methodologically inconsistent, as it is not guaranteed to represent the region in parameter space with the highest probability mass. The spike-and-slab prior, which we use, avoids this methodological inconsistency and achieves sparsity in a sound Bayesian inference context. (3) In our hierarchical Bayesian models, all sources of uncertainty are properly accounted for. The higher-level hyperparameters have their own distributions, which are systematically inferred from the data. In contrast, the regularisation parameters of the established methods are typically fixed, set e.g. by cross-validation, but without taking their uncertainty into account (see also Chapter 5 in Gelman et al. (2013) for a more detailed discussion). (4) In our approach, we explicitly model all dependencies among the variables, and inference is carried out within the context of the whole system. This systematically borrows strength from information coupling and avoids the piecemeal approach of established methods.

There are two fundamentally different approaches to variable selection in Bayesian hierarchical models: the slab-and-spike prior, whereby the influence of an input variable is controlled via the prior distribution of its associated regression parameters, and the binary mask model, where variables are put through a binary multiplicative filter. The difference is depicted in Fig. 1 here and Figure 3 in the online supplementary materials. Which method is better? Standard textbooks, like Murphy (2012), describe both methods (see Chapter 13), but do not offer a comparative evaluation, and in the literature, authors rather arbitrarily tend to opt for one method or another (see e.g. Heydari et al. (2016)). We have carried out a systematic comparison to properly quantify the difference in terms of accuracy and computational efficiency, and found it to be negligible. We have also systematically evaluated the influence of the prior, comparing a conjugate with a non-conjugate prior, as depicted by Fig. 1 here and Figure 1 in the online supplementary materials, and we have assessed its influence systematically in terms of accuracy, computational efficiency, and formal model selection preference in Table 1. The differences in accuracy are negligible (see e.g. Fig. 5). The conjugate model has slightly better computational efficiency (Fig. 6), but this difference is not significant; this finding indicates that the bottleneck in the computational procedure is the sampling of the latent variables rather than the regression parameters. The conjugate model shows a slight but significant improvement over the non-conjugate model in a number of the model selection scores based on WAIC, as seen from Table 1 here and Table 7 of the online supplementary materials, but this has little immediate impact on the variable selection. Overall, our findings demonstrate a remarkable robustness of the proposed hierarchical modelling framework with respect to minor model modifications, which boosts our confidence in the predictions and in the variable ranking.

Further to this we have investigated the sampling of latent inclusion variables. We have shown that by proposing multiple variables simultaneously through Metropolis–Hastings sampling it is possible to give a significant computational improvement over the conventional component-wise Gibbs sampler (Fig. 9). We have shown this improvement in a number of different datasets and have offered a general rule of thumb that proposing 10 or 15 (or around 7%) of the variables at a time will lead to good mixing within MCMC chains for a variety of different datasets.

Through the use of this new model with the improved sampling techniques we have been able to identify an increased number of known antigenic sites in the SAT1 serotype of FMDV (Grazioli et al. 2006) compared to Reeve et al. (2010) and Reeve et al. (2016), while incurring no (for the default selection threshold 0.5) or only a very small number (for the inferred selection threshold \(J {\hat{\pi }}\)) implausible residues. Very little biological knowledge exists about the SAT2 serotype, and a previous in silico application has failed to make any predictions at all (Reeve et al. 2010). To our knowledge, our study is the first time that specific new hypotheses about genetic-antigenic associations have been made with an in silico model based on the currently available data. Additionally we have provided an insight into the evolutionary history of the SAT serotypes (Figs. 8 and 10 ) and have provided a novel way of interpreting the biological effects of these virus mutations. Finally we have identified a number of significant antigenic sites in the H1N1 Influenza virus and provided new hypotheses for this virus.

7.1 Future work

Further work to follow on from this paper comes in several forms. Firstly we would like to find a way of effectively using the proposed model on the full H1N1 dataset used in Harvey et al. (2016), without having to summarise and reduce the data as we have done here. With 15,693 measurements, currently the model is too computationally expensive to get a reasonable number of posterior samples. More computationally efficient methods, such as variational inference or expectation propagation (Minka 2001), can offer a solution to this problem at the expense of an approximation. Methods for using spike and slab priors have been proposed for both of these techniques, e.g. Titsias and Lázaro-Gredilla (2011) and Hernández-Lobato et al. (2013), and these can be extended to create alternative, faster versions of the SABRE methods.

Secondly we would like to investigate how different types of mutations affect antigenic variability. At present the data simply consists of indicators of mutational changes that occur without any regard to the type of mutation, something which is addressed in Reeve et al. (2016). Adding this information will dramatically increase the number of variables on which selection must be made and is likely to make inferring \(\varvec{\gamma }\) more difficult. To address this we will likely require improved proposal distributions for \(\varvec{\gamma }\), as has been used for continuous variables in Haario et al. (2006), which account for the estimated posterior correlations between these latent inclusion parameters. A method to generate correlated binary variables has been proposed in Leisch et al. (1988). However we would require a density function to put into the Metropolis–Hastings ratio in order to use this within MCMC sampling.

With more information, and therefore more variables, relating to the mutations being included in the models, it may be necessary to add additional information sharing between the latent indicator variables, \(\varvec{\gamma }\). Latent Gaussian processes can be used to model this, where inference can be achieved in a variety of ways, e.g. Filippone et al. (2013) and Andersen et al. (2014). The use of latent Gaussian processes would allow us to introduce correlations between mutations of the same type or mutations occurring in similar location on the surface of the virus shell. This can potentially allow us to identify which types of mutations are important, as well as identifying complete antigenic regions rather than just individual residues.