Skip to main content
Log in

Dealing with sources of variability in the data-analysis of phenotyping experiments with transgenic rice

  • Published:
Euphytica Aims and scope Submit manuscript

Abstract

The analysis of phenotyping experiments for transgenics deserves special attention. Experiments set up for the detection of interesting phenotypes among transgenic plants have to screen several primary events obtained by transforming with a particular transgene, since expression levels of the transgene differ considerably. Agronomically most interesting lines might have an intermediate level of transgene expression. Therefore, attention should be paid to all transformants and how their expression levels differ. Experimental design and the analysis of the data have to focus on the variability among lines and have to be able to detect small differences in quantitative traits. The mixed model is the most adequate approach to analyse data of phenotyping experiments because it reflects the structure and provides the researcher with important measures to allow broader inferences. The paper explains the model and illustrates it using a screening experiment carried out by the high-throughput phenotyping method of TraitMillTM. Besides inference for a particular experiment and a particular set of lines, the output allows more general predictions for a wider set of non-tested lines. It quantifies the various sources of variability encountered and helps to understand the underlying process. It also helps to optimise the experimental set-up of future experiments. The model presented here has been implemented in the R-language and SAS. The scripts are attached.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Ayres MP, Thomas DL (1990) Alternative formulation of the mixed-model anova applied to quantitative genetics. Evolution 44:221–226

    Article  Google Scholar 

  • Bhat SR, Srinivasan S (2002) Molecular and genetic analyses of transgenic plants: considerations and approaches. Plant Sci 163:673–681

    Article  CAS  Google Scholar 

  • Bhattacharyya MK, Stermer BA, Dixon RA (1994) Reduced variation in transgene expression from a binary vector with selectable markers at the right and left T–DNA borders. Plant J 6:957–968

    Article  CAS  Google Scholar 

  • Bishop J, Venables WN, Wang Y-G (2004) Analysing commercial catch and effort data from Penaeid trawl fishery. A comparison of linear models, mixed models, and generalised estimating equation approaches. Fish Res 70:179–193

    Article  Google Scholar 

  • Butaye K (2004) Analysis of transgene expression variation in Arabidopsis thaliana. Dissertation, Katholieke Universiteit Leuven

  • Caroll RJ (2003) Variances are not always nuisance parameters. Biometrics 59:211–220

    Article  Google Scholar 

  • Cellini F, Chesson A, Colquhoun I et al (2004) Unintended effects and their detection in genetically modified crops. Food and Chemical Toxicology 42:1089–1125

    Article  PubMed  CAS  Google Scholar 

  • Cooper M, Hammer GL (1996) Plant adaptation and crop improvement. CAB International, Wallingford, UK

  • Fox JC, Ades PK, Bi H (2001) Stochastic structure and individual-tree growth models. For Ecol Manage 154:261–276

    Article  Google Scholar 

  • Fry JD (1992) The mixed-model analysis of variance applied to quantitative genetics: biological meaning of the parameters. Evolution 46:540–550

    Article  Google Scholar 

  • Gleeson AC, Cullis BR (1987) Residual maximum likelihood (REML) estimation of a neighbour model for field experiments. Biometrics 43:227–288

    Article  Google Scholar 

  • Green PJ (1985) Linear models for field trials, smoothing and cross–validation. Biometrika 72:527–537

    Article  Google Scholar 

  • Greenland S (2000) Principles of multilevel modelling. Int Epidiomiol Assoc 29:158–167

    Article  CAS  Google Scholar 

  • Henderson CR (1953) Estimation of variance and covariance components. Biometrics 9:226–252

    Article  Google Scholar 

  • Hennessy DA, Miranowski JA, Babcock BA (2004) Genetic information in agricultural productivity and product development. Am J Agric Econ 86:73–87

    Article  Google Scholar 

  • Jones JDG, Dunsmui D, Bedbrook J (1985) High level expression of introduced chimearic genes in regenerated transformed plants. The EMBO Journal 4:2411–2418

    PubMed  CAS  Google Scholar 

  • Kumar S, Fladung M (2001) Gene stability in transgenic aspen (Populus). II. Molecular characterization of variable expression of transgene in wild and hybrid aspen. Planta 213:731–740

    Article  PubMed  CAS  Google Scholar 

  • Larkin PJ, Scowcroft WR (1981) Somaclonal variation—a novel source of variability from cell cultures for plant improvement. Theor Appl Gene 60:197–214

    Article  Google Scholar 

  • Littell RC, Milliken GA, Stroup WW et al (1996) SAS system for mixed models. SAS Institute Inc., Cary, NC

  • Manor O, Zucker DM (2004) Small sample inference from the fixed effects in the mixed linear models. Comput Stat Data Anal 46:801–817

    Article  Google Scholar 

  • Meuwissen THE, Goddard ME (1997) Estimation of effects of quantitative trait loci in large complex pedigrees. Genetics 146:409–416

    PubMed  CAS  Google Scholar 

  • Meyer P (1995) Understanding and controlling transgene expression. Trends Biotechnol 13:332–337

    Article  CAS  Google Scholar 

  • Mlynáová L, Keizer LC, Stiekema WJ et al (1996) Approaching the lower limits of transgene variability. Plant Cell 8:1589–1599

    Article  Google Scholar 

  • Oleksiak MF, Churchill GA, Crawford DL (2002) Variation in gene expression within and among natural populations. Nat Genet 32:261–266

    Article  PubMed  CAS  Google Scholar 

  • Paterson S, Lello J (2003) Mixed models: getting the best use of parasitological data. Trends Parasitol 19:370–375

    Article  PubMed  Google Scholar 

  • Patterson HD, Thompson R (1971) Recovery of interblock information when block sizes are unequal. Biometrika 58:545–554

    Article  Google Scholar 

  • Piepho HP, Pillen K (2004) Mixed modelling for QTL x environment interaction analysis. Euphytica 137:147–153

    Article  CAS  Google Scholar 

  • Piepho HP, Büchse A, Emrich, K (2003) A hitchhiker’s guide to mixed models for randomized experiments. J Agron Crop Sci 189:310–322

    Article  Google Scholar 

  • Pinheiro JC, Bates DM (2000) Mixed–effect models in S and S–Plus. Springer, NY

    Google Scholar 

  • Reuzeau C, Frankard V, Hatzfeld Y et al (2005) TraitMillTM: a functional genomics platform for the phenotypic analysis of cereals. Plant Genet Resour 4:1–5

    Google Scholar 

  • Reyes JC, Hennig L, Gruissem W (2002) Chromatin–remodeling and memory factors. New regulators of plant development. Plant Physiol 130:1090–1101

    Article  PubMed  CAS  Google Scholar 

  • Stam M, Mol JNM, Kooter JM (1997) The silence of genes in transgenic plants. Ann Bot 79:3–12

    Article  CAS  Google Scholar 

  • Vain P, James VA, Worland B et al (2002) Transgene behavior across two generations in a large random population of transgenic rice plants produced by particle bombardment. Theor Appl Genet 105:878–889

    Article  PubMed  CAS  Google Scholar 

  • Wang DL, Zhu J, Li ZK et al (1999) Mapping QTLs with epistatic effects and QTLx environment interactions by mixed model approaches. Theor Appl Genet 99:1255–1264

    Article  Google Scholar 

  • Welham SJ and Thompson R (1997) Likelihood ratio tests for fixed model terms using residual maximum likelihood. J Royal Stat Soc 59:701–714

    Article  Google Scholar 

  • Welham S, Thompson R (2000) REML analysis of mixed models. In: The Guide to Genstat. Part 2: Statistics, pp 413–500

  • Welham S, Cullis B, Gogel B et al (2004) Prediction in linear mixed models. Aus NZ J Stat 46:325–347

    Article  Google Scholar 

  • Wernisch L, Kendall SL, Soneji S et al (2003) Analysis of whole-genome microarray replicates using mixed models. Bioinformatics 19:53–61

    Article  PubMed  CAS  Google Scholar 

  • Wolfinger RD, Gibson G, Wolfinger ED et al (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 8:625–637

    Article  PubMed  CAS  Google Scholar 

  • Xu S, Yi N (2000) Mixed model analysis of quantitative trait loci. Proc Nat Acad Sci U.S.A 97:14542–14547

    Article  CAS  Google Scholar 

  • Yang Y, Hoh J, Broger C et al (2003) Statistical methods for analyzing microarray feature data with replication. J Comput Biol 10:157–169

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joris De Wolf.

Appendix

Appendix

The variance-covariance matrix of the mixed model

The important features of the mixed model are given in the main text. Some more explanation should be provided on the variance structure. The matrix V is a n ×  n matrix (n is the total number of plants) with a block-withina-block structure. Each larger block represents the variance-covariance between all plants of a line and is composed of 4 subblocks, indicating either the variance-covariance between nulls of a line, the variance-covariance between transgenic plants of a line, or the covariance between transgenics and nulls of a line. The rest of the matrix V is filled by zeros. For instance, the block for the first line (V1) is given by:

$$ {\mathbf{V}}_1 = \left[ \begin{array}{ll} V_a& V_{ag}\\ V_{ag} &V_g\\ \end{array}\right] $$

where Va and Vg are square matrices with dimensions equal to the number of nulls (in Va) or transgenics (in Vg) of line 1, with the structure:

$$ {\mathbf{V}}_{\rm a}=\left[ \begin{array}{lllll} \sigma^2_a+\sigma^2_e & \sigma^2_a & \sigma^2_a& \cdots & \sigma^2_a\\ \sigma^2_a &\sigma^2_a+\sigma^2_e&\sigma^2_a&\cdots&\sigma^2_a\\ \sigma^2_a&\sigma^2_a&\sigma^2_a+\sigma^2_e&\cdots&\sigma^2_a\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \sigma^2_a&\sigma^2_a&\sigma^2_a&\cdots&\sigma^2_a+\sigma^2_e\\ \end{array} \right] $$
$$ {\mathbf{V}}_{\rm g}=\left[ \begin{array}{lllll} \sigma^2_g+\sigma^2_e & \sigma^2_g & \sigma^2_g& \cdots & \sigma^2_g\\ \sigma^2_g &\sigma^2_g+\sigma^2_e&\sigma^2_g&\cdots&\sigma^2_g\\ \sigma^2_g&\sigma^2_g&\sigma^2_g+\sigma^2_e&\cdots&\sigma^2_g\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \sigma^2_g&\sigma^2_g&\sigma^2_g&\cdots&\sigma^2_g+\sigma^2_e\\ \end{array} \right] $$

where as Vag is rectangular and all elements equal to σ ag :

$$ {\mathbf{V}}_{\rm ag}=\left[ \begin{array}{ccc} \sigma_{ag}&\cdots&\sigma_{ag}\\ \vdots&&\vdots\\ \sigma_{ag}&\cdots&\sigma_{ag}\\ \end{array}\right] $$

Estimation of the parameters

Estimation of variance components

The estimation of σ 2 e , σ 2 a , σ 2 g and σ ag cannot be achieved with the usual least squares procedure. Several methods have been proposed for this purpose, but Maximum Likelihood (ML) approaches are commonly used with the Restricted Maximum Likelihood (REML (Patterson and Thompson, 1971)) approach specifically recommended and found as the default method in most statistical software. ML-methods can be used when the form of the probability of the distribution is known (or assumed). Through iterative algorithms, ML chooses as estimates for the parameters those values that are most consistent with the sample data. For mixed models ML yields biased estimates because it does not take in to account fact that fixed effects are in fact estimates when estimating the variance components. REML is similar to ML but takes into account the fixed effects estimates through the use of appropriate linear combinations of observations to obtain unbiased estimates.

Prediction of the random effects

The BLUP of the nulls of line i is given by the expected value of the random effect, conditional on the mean of the observations of the nulls of the i th line (\(\overline{y}_{i0}.\)), but also on the mean of the transgenics of this line (\(\overline{y}_{i1}.\)):

$$ \hat{a}_i = E(a_i|\overline{y}_{i0.}, \overline{y}_{i1.}) $$
(1)
$$ \hat{g}_i = E(g_i|\overline{y}_{i0.}, \overline{y}_{i1.}) $$
(2)

To solve 1 we need the joint distribution of a i , \(\overline{y}_{i0}.\) and \(\overline{y}_{il}.\), which would be fully known if σ 2 a , σ 2 g , σ ag and σ 2 e would be known. In the absence of the true values for these parameters, the REML estimates are used. Furthermore, â i and ĝ i depend also the number of observation, respectively n io and n il . The BLUP of real interest in these experiments is the effect of the individual insertions or \(\widehat{g_i - a_i}\), which is obtained by subtracting (1) from (2). The prediction of these effects will differ from their estimates in the fixed model, and can also be shrunk or expanded. Examples of these BLUPs and their shrinkage or expansion are given in the main text.

Estimation of fixed effects

The fixed effects in a mixed model are estimated through a generalised least squares estimation much like in fixed ANOVA. This method needs to know the variance components. Since these components are not known precisely, but estimated by REML, the GLS estimator of the fixed effects is biased. A small bias is acceptable since the estimator is more precise than the other candidate estimators (Greenland, 2000).

Simultaneous estimation of fixed and random effects

A method for getting BLUPs and fixed effects simultaneously is through the solution of Henderson’s mixed model equations, given by

$$ \left[\begin{array}{ll} X^{T}R^{-1}X& X^{T}R^{-1}X\\ Z^{T}R^{-1}X& Z^{T}R^{-1}X + G^{-1} \end{array} \right]\left[\begin{array}{l} \hat{\beta}\\ \hat{u}\end{array}\right]=\left[\begin{array}{l}X^{T}R^{-1}Y\\ Z^{T}R^{-1}Y\end{array}\right] $$
(3)

where Y,X and Z are the matrices from equation (5) from the main text. R is the dispersion matrix of the residuals, i.e. σ 2 e I k . G is the dispersion matrix of the random effects, defined by matrix (9) from the main text. The advantage of this formulation is that the calculations involve the inverting of matrices is far simpler than the inversion of V, which is necessary in the generalised least squares solution.

From equations (3) it can be derived that for the present model the BLUP for the nulls of the first line is:

$$ \begin{array}{ll} \hat{a}_i=&(\sum r_{null}(\frac{\sigma_a^2}{D}+\frac{n}{\sigma_e^2})+ \sum r_{tr} \frac{\sigma_{ag}^2}{D})\\ &*\frac{\sigma_p^2 D^2}{(\sigma_a^2\sigma_e^2+nD)(\sigma_g^2\sigma_e^2+nD)-\sigma_{ag}^2} \end{array} $$
(4)

where ∑ r null is the sum of all deviations of the null plants in line 1 from the fixed effect, ∑ r tr the sum of all deviations of the transgenic plants in line 1 from the fixed effect, n the number of null plants in line 1, and

$$ D = \sigma_a^2\sigma_g^2-\sigma_{ag}^2 $$

The BLUP for the transgenics of the first line is similar. It is clear from equation 4 that the BLUP for the nulls does not only involves the observations for the nulls, but also the observations of the transgenics, the variance components and the number of observations. The smaller the covariance σ ag , the less important are the observations on the transgenics. The BLUP also tends to be more close to the fixed estimate if the number of observations n is large and the residual variance σ 2 e is small. D will tend to zero if the correlation ρ ag tends to 1. High correlations will cause a strong shrinkage.

Estimation of linear combinations

The mixed model equations (3) can also be used to estimate linear combinations of fixed effects and/or BLUPs. For instance, the BLUP of a particular linear combination is defined as:

$$ L^{T}\widehat{\beta+M^T} u=L^{T}\hat{\beta}+M^{T}\hat{u} $$
(5)

The variance of this linear combination is given by:

$$ V ar(L^T\hat{\beta}+M^{T}\hat{u}) =[L^T\quad M^T] [C^{-1}]\left[\begin{array}{l} L\\M\end{array}\right] $$
(6)

where C is short for the leftmost matrix of equations (3). All predictable functions can be constructed this way. A detailed description of prediction with mixed models is given in reference (Welham et al, 2004). An important concept in this context is the prediction space, i.e. the population to which the inference applies. We will elaborate further on this in the next section.

Inference for mixed models

After we have fit a model, we normally want to assess the significance and the precision of the estimates. In these experiments there are only two test that are of interest. First there is the test of whether the fixed effect, i.e. the global effect of the transgene, is significant. Second, we are interested in comparing the BLUPs for nulls and transgenics within a line.

Testing a fixed effect

The global difference between nulls and transgenics over all lines is a fixed effect. Fixed effects or linear combinations of fixed effects in mixed models could theoretically be tested by a Wald test. The Wald statistic is given by

$$ A=\frac{\hat\beta_i^2}{\hat{Va}r(\hat\beta_i)} $$
(7)

Under the null hypothesis H 0 i  = 0 against H 0: β i ≠  0, A has asymptotically a χ2 -distribution with 1 degree of freedom. However, since the variance is estimated based on small samples, the Wald test is inaccurate, having high Type I error rates. Solutions include (Bartlett adjusted) likelihood ratio tests (Welham and Thompson, 1997), but in most cases the Wald statistic is still used but evaluated against an F ν1,ν2-distribution. In this empirical method the numerator degrees of freedom ν2 is equal to the number of contrasts tested. For the denominator degrees of freedom ν1 several possible solutions exist. Recommended are the Satterthwaite method and the naive method (Manor and Zucker, 2004). For the experiment described above both methods will reduce to twice the number of transgenic lines tested (2q).

Testing a linear combination of fixed and random effects

Linear combinations can be tested through a t-test or by constructing confidence intervals based on the standard errors obtained from the mixed model equations (6). When calculating standard errors in the context of mixed models, the inference space has to be defined. The prediction space is narrow when the prediction applies to the insertions tested in the experiment. In other words, the prediction for the narrow space is resembling estimates in a fixed effects model. On the other hand, the prediction space is broad when the prediction applies to the entire population of insertions. If some random effects are kept fixed and others are random, the prediction space is intermediate. The prediction space can be set by the matrix \(\user2{M}\) in the equation (5) and (6). A prediction in the broad space will have all elements of \(\user2{M}\) equal to zero. In this way, the random effect does not contribute to (5) but it does add a variance component to (6). In the narrow space, all or some elements of \(\user2{M}\) are non-zero. The point prediction is influenced by non-zero elements but they add no variance. The contrasts between the transgenics and the control plants of a particular event involves both fixed and random effects, but for the random effect a particular level is chosen. In other words, these contrast apply necessarily to the narrow prediction space.

Computer programme

All analyses have been done using R version 2.4.0 and the library nlme. The main analysis statement is

$$ \begin{array}{l} {\tt mod < - lme(data = DS,} \\ {\tt fixed = TKW\ ^{\sim}\ Transgenity,}\\ {\tt random = list(Line = pdLogChol({}^{\sim}Transgenity - 1)))} \end{array} $$

With the more recent R library lme4 the same model is obtained by

$$ \begin{array}{l} {\tt mod\ < -\ lmer(data = DS,}\\ {\tt TKW\ ^{\sim}\ Transgenity + (Transgenity-1|Line)}) \end{array} $$

and most presented data can be extracted from the mod object through summary(mod), fixef(mod), ranef(mod) and n getVarCov(mod). For the calculation of standard errors and p-values on differences between BLUPs, a function had to be written based on the Henderson formulation. This function is available from the authors.

The same model can be obtained with SAS proc mixed with the programme:

$$ \begin{array}{l} {\tt TransgenityR = Transgenity}\\ {\tt proc\ mixed\ data=DS \quad method=reml \quad ;}\\ {\tt class\ Transgenity\ Line;}\\ {\tt model\ TKW=Transgenity\ /\ solution}\\ {\tt ddfm=satterth \quad outp=outpredict \quad ;}\\ {\tt random\ TransgenityR\ /\ subject=Line}\\ {\tt type=UN \ solution\ g \quad ;}\\ {\tt run;} \end{array} $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wolf, J.D., Duchateau, L. & Schrevens, E. Dealing with sources of variability in the data-analysis of phenotyping experiments with transgenic rice. Euphytica 160, 325–337 (2008). https://doi.org/10.1007/s10681-007-9526-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10681-007-9526-z

Keywords

Navigation