Introduction

The genetic consequence of selection is the change in frequencies of the genes affecting fitness. The process of evolution is reflected by the dynamic change of gene frequencies by selection and other evolutionary agents. Fitness is a complicated trait, which can be decomposed into many fitness components (Falconer and Mackay, 1996; Hartl and Clark, 1997). Therefore, the genetic variance of fitness is considered to be controlled by the segregation of multiple genes. Fitness behaves like a quantitative trait. It responds to natural selection with a response equal to the genetic variance of fitness (Fisher, 1958). To study the genetic architecture of fitness, it is important to explore the change of gene frequency of alleles at individual loci. However, only in very limited situations, for example, where allozyme markers are available, can we evaluate natural selection on individual loci. In most situations, we do not know what the genes are and where in the genome the genes are located. With the rapid development of molecular technology, large amounts of molecular data are now available, which provide a great opportunity to estimate the effects and locate the chromosomal positions of loci responsible for complicated traits, for example, quantitative traits. The technology is now called quantitative trait locus (QTL) mapping. Since fitness is just another complicated trait with a polygenic background, a similar technology can be applied to map loci determining variation in fitness.

Although it does not seem easy to map fitness loci, statistical methods of mapping QTL can be adopted (Lander and Botstein, 1989). Fu and Ritland (1994a),(1994b) first utilized a QTL mapping approach to map viability (a fitness component) loci under the maximum likelihood (ML) framework. Mitchell-Olds (1995) also proposed a similar ML method for viability mapping in F2 families. Recently, Vogl and Xu (2000) investigated a Bayesian method to map viability loci in a backcross family. All the aforementioned existing methods deal with line-crossing experiments that require inbred lines. Inbred lines, however, may not be available for many species, such as humans, large animals and trees (Hedrick and Muona, 1990). Mapping viability loci may be more relevant to natural populations than to line crosses. This is equivalent to the situation where mapping QTLs is more relevant to breeding populations than to designed line crosses. However, it is easier to map QTLs in line-crossing experiments because we can control the genetic background and environments. After QTL are mapped in line crosses, the results may be extended to natural populations or used to find homologous loci in closely related species. Similarly, viability loci may be mapped in line crosses and the inference later extended to natural populations. In this study, we attempt to map viability loci directly in outbred populations. Full-sib families are the simplest outbred populations. Although not necessarily natural populations, they are one step closer to natural populations than are line crosses.

The fitness of a genotype at a locus is the average fitness of all individuals bearing this genotype. If we assign the fitness for the ‘best’ genotype a value of one, the selection coefficient for an arbitrary genotype is defined as the reduction in fitness from this maximum value. Therefore, we only describe the measurement of fitness (rather than the selection coefficient) in subsequent discussion. Viability is only one of many components of fitness. Fecundity is another important component. In this study, however, we focus only on loci responsible for viability selection, assuming that all surviving individuals have an equal fecundity.

We develop a model of viability mapping that uses a full-sib family derived from the mating of two unrelated outbred parents. A full-sib family contains four different alleles at a single locus, rather than two as is usually assumed in inbred line crosses. Mapping in a full-sib family requires the general rule of allelic transmission from parents to children and thus the algorithm can be extended to pedigree analysis. The method can be directly applied to fitness analysis for open-pollinated plants.

Theory and methods

Genetic model of fitness

Consider a single viability locus and a full-sib family. Denote the genotypes of the sire (paternal parent) and dam (maternal parent) by As1As2 and Ad1Ad2, respectively. Mating between the two parents will generate progenies each with one of the four possible genotypes: {As1Ad1,As1Ad2,As2Ad1,As2Ad2}. Under the assumption of Mendelian segregation, the four genotypes will have an equal frequency, that is, ¼. If this locus is subject to viability selection, we will observe two or more genotypes, which have frequencies different from Mendelian expectations.

To model viability selection, we define the underlying frequencies of the four genotypes in the progeny by a vector w=[w11 w12 w21 w22] for 0≤wkl≤1, ∑klwkl=1 and k,l=1,2. These frequencies are now defined as the relative fitness of the four genotypes. This is a little different from the usual definition of relative fitness in which the maximum fitness is set to one and the rest expressed as reduced values relative to one. Deviation of w from the Mendelian vector w0=[¼ ¼ ¼ ¼] reflects the intensity of viability selection.

The fitness of a genotype can be decomposed into the product of the fitness of the two alleles that make up the genotype and a deviation reflecting the interaction between the two alleles, called the dominance effect, that is

where wsk and wdl denote the relative fitness of the kth allele of the sire and the lth allele of the dam, respectively, and δkl is the dominance effect. This partitioning of the fitness is important because we can separate gametic selection from zygotic selection using statistical technology. Note that there are four possible genotypes in the progeny, but after the decomposition we have eight parameters. Therefore, we must impose some restriction to the parameters to make the model estimable. We take the restrictions similar to those used in the four-way cross model (Xu, 1998) and define three new independent parameters:

It is interesting to know that the fitness values of the four genotypes can be expressed as functions of the three independent parameters, as shown below:

This model is important in hypothesis tests and computer simulations that will be discussed in later sections.

ML estimation

We first assume that the four alleles of the viability locus in the parents are distinguishable and the genotypes are observable. Suppose that we sample n individuals from the full-sib family in question. Let us define

where yj(kl)=1 and y j ( k ' l ' ) = 0 for k′≠k and l′≠l if individual j takes genotype AskAdl. We now have the data, y, and the parameter, w, which allow the construction of the log likelihood:

The ML estimate of w is simply

for k,l=1,2.

In fact, the genotype of a viability locus cannot be observed and we must use markers to infer the genotype. Unless the viability locus is located exactly at a fully informative marker, inference will be subject to error. The amount of error depends on the distances of the viability locus from marker loci, the level of marker polymorphism and the genotypes of the markers. As a result of the error, we are not certain about the actual genotype of the viability locus for each individual, even though we can observe the marker genotypes. The viability locus can take any one of the four genotypes, but with a different probability for each genotype given the marker information. Define the four conditional probabilities of the given viability locus markers by pj= [pj(11) pj(12) pj(21) pj(22)] for 0≤pj(kl)≤1 and ∑2k=12l=1pj(kl)=1. This is a typical problem of missing values in statistics where we can use the expectation-maximization (EM) algorithm to solve for the MLE. The actual incomplete-data log likelihood is

where the missing data y have been ‘integrated out’. There are several ways to solve the MLE, but we take the EM algorithm (Dempster et al, 1977).

First, we choose an initial value w(0) and calculate the expectation of yj(kl) conditional on w=w(0),

which is also called the posterior probability of yj(kl). We have now completed the expectation step (E-step). The maximization step (M-step) is simply to replace yj(kl) in equation (7) by the conditional expectation,

This concludes the first iteration of the EM algorithm. The iteration continues until convergence at the tth iteration and the MLE takes ŵ=wt. According to the invariance property of MLE, we have

The EM algorithm provides a convenient way to solve the MLE, but it does not automatically give the asymptotic variance–covariance matrix of ŵ, which must be obtained separately through some additional computation (Louis, 1982). This is the drawback of the EM algorithm compared to Fisher's scoring method, which automatically provides an asymptotic variance–covariance matrix for the MLE. However, Fisher's scoring method requires calculation of the information matrix, which is not easy in the missing value problem. In practice, we can use the bootstrap method (Efron, 1979) to assess the variance–covariance matrix. The bootstrap method is computationally demanding, but the method is executed only once after convergence has been reached and only on the positions that show significant evidence of viability selection.

Hypothesis test

Recall that the conditional probability of the viability locus genotype is calculated from marker information with the assumption that the location of the viability locus relative to the markers is known. Therefore, the hypothesis test on the effects of the viability locus is actually a conditional test given the position of the viability locus. If the test is not significant, we will conclude that the current position of the chromosome being tested does not segregate for a viability locus. To test the overall hypothesis of no viability selection, we need to scan the entire genome (multiple tests). The null hypothesis (no viability selection) will be rejected if none of the locus-specific tests is significant. We will discuss the overall test later and now focus on the test of an individual locus.

The first null hypothesis is H0: w=ŵ0, which tests no segregation distortion for the locus of interest. The test statistic is λ=−2[L(w0)−L(ŵ)], where L(w0)=nln(¼)= −1.3863n. Under the null hypothesis, λ will approximately follow a χ2 distribution with three degrees of freedom.

If this null hypothesis is rejected, we can further test the significance of each component. The null hypothesis that the two alleles carried by the sire have identical fitness is formulated by Hs: ws=0, wd≠0, δ≠0. The test statistic for Hs is λs=−2[L(ŵs)−L(ŵ)] where L(ŵs) is the log likelihood value obtained by maximizing L(w) under the restriction of ws=(w11+w12)−w21+w22)=0, which is achieved by using the Lagrange multiplier. A more intuitive and easier way to enforce the restriction is to make the substitutions, w12=½−w11) and w22=½−w21), which reduces the number of parameters to two, w11 and w21. The EM solutions of these two parameters are

and

because the denominators equal due to the restrictions. The MLE of the remaining parameters are ŵ12=½−ŵ11) and ŵ22=½−ŵ21. Under Hs, λs will approximately follow a χ2 distribution with one degree of freedom.

The null hypothesis that the two alleles carried by the dam have identical fitness is formulated by Hd: wd=0, ws≠0, δ≠0, where the test statistic for Hd is λd=−2[L(ŵd)−L(ŵ)], with L(ŵd) being the log likelihood value obtained by maximizing L(w) under the restriction of wd=(w11+w21)−w12+w22)=0. The EM solutions of the parameters are

and

The MLE of the remaining parameters are ŵ21=½−ŵ11 and ŵ22=½−ŵ12.

Again, under Hd, λd will approximately follow a χ2 distribution with one degree of freedom.

The null hypothesis that the dominance effect is absent is formulated as Hd: δ=0, ws≠0, wd≠0. Let us define

The MLE under this restriction are ŵ11=ab, ŵ12=a(1−b), ŵ21=(1−a)b and ŵ22=(1−a)(1−b). Again, the test statistic for Hδ is λδ=−2[L(ŵδ)−L(ŵ)], which follows approximately a χ2 distribution with one degree of freedom.

Genome scanning

To scan viability loci for the entire genome, we need to move the putative position from one end to the other end of the genome. The genotype of each chromosome position for each individual is inferred from marker information, that is, pj(kl)=Pr(yj(kl)=1IM), where IM stands for marker information. For outbred populations, not all markers are fully informative. Therefore, we adopted the multipoint method developed by Rao and Xu (1998) to infer the probabilities of viability loci. This multipoint method is identical to that of Kruglyak and Lander (1995) when the linkage phases of the parents are known. In our study, we focus on developing the genetic model of viability mapping rather than the multipoint method. Therefore, we assume that the parental marker linkage phases are known without error. This assumption holds very well when the family size is sufficiently large because the true linkage phases can be easily recovered using marker information of the progeny.

To find the optimal location of the viability locus on the chromosome, we test all putative positions. However, the chromosome is a continuous linear structure, and there are an infinite number of putative positions. As usually done in interval mapping (Lander and Botstein, 1989), we scan the whole chromosome from one end to the other by evaluating a position in every one or two cM. The likelihood ratio test statistic is then plotted against the chromosomal position to form a test statistic profile. The MLE of the position of viability locus takes the one where the peak occurs. The critical value used for declaring at least one viability locus on the entire genome with a type I error rate of α = 0.05 is found using the permutation test (Churchill and Doerge, 1994).

Monte Carlo simulation

We simulated one chromosome of length 100 cM with 11 markers evenly spaced. The two alleles of each parent at each locus were randomly assigned from five distinguishable alleles (randomly selecting two out of five). This generates markers with a range of information content. A single viability locus was simulated at position 25 cM, that is, between markers 3 and 4. The following factors were considered in the simulations: the mode of viability selection, the intensity of viability selection and sample size of the mapping population. The purpose of the simulation was not to compare the relative efficiencies of different methods for viability mapping (since there are no other methods to compare), nor to investigate the range of parameter values where the method works best. Instead, we simply attempted to demonstrate that the method works well and the test statistic behaves as expected. From this simulation study, we try to validate our method and program of viability mapping.

The mode of viability selection was investigated under three levels: an additive model, a dominance model and a combination of both additive and dominance. For the additive model, we set δ = 0 and ws = wd = 0.1,0.2,0.3. From these parameters, the fitness values of the four genotypes were generated. Under the dominance model, we set ws = wd=0 and δ = −0.15,0.05,0.15. We also investigated one model with both the additive and dominance effects, that is ws = wd=0.15 and δ = 0.1.

From the three effects of the viability locus, we use equation (3) to calculate the actual fitness values of the four possible genotypes. For example, when ws = wd=0.15 and δ = 0.1, the four fitness values are

Following the conventional notation of natural selection, we calculate the selection coefficient for the fitness of genotype AskAdl using skl = 1 − wkl/wmax (Hartl and Clark, 1997). These selection coefficients were used to determine whether an individual with genotype AskAdl should be deleted from the mapping population (Table 1).

Table 1 Parameter values used in the simulation experiments

Three different sample sizes under each of the above models were investigated, n=50,:100,:200. The estimated location of the putative viability locus under each analysis took the position where the peak of the test statistic profile occurred. The simulation was replicated 100 times under each setting. The means and standard deviations of the 100 replicates were used to evaluate the performance of each parameter combination.

The empirical statistical power for each setting was calculated as the percentage of the replicates (out of 100 simulations) with the highest (overall) test statistic (along the chromosome) greater than the empirical critical value. The expected standard error for the empirical power is where Nr is the number of replicates. For example, if the true power is 1−β=0.8, the standard error is 0.04, which is reasonably small. The critical value was obtained by simulating additional 1000 samples under the null hypothesis. The highest test statistics of the 1000 samples were ranked from the lowest to the highest. The empirical critical value took the 95th percentile of the distribution of the null samples.

The test statistic profile of a single replicate for the combined additive and dominance model (ws=wd=0.15 and δ = 0.1) with sample size n=100 is demonstrated in Figure 1 (the dotted line). From the total test statistic profile, we can see that the viability locus has been identified at position 23 cM, very close to the true position (25 cM). The estimated effects for this particular run are ŵ11=0.5902), ŵ12=0.03290, ŵ21=0.03410 and ŵ22=0.3528. These estimated fitness values were converted into =0.2071 using equation (2). The estimated effects are larger than the simulated effects, but maintain the same trend. The deviations are not larger than expected considering the sampling errors with n= 100. The average test statistic profile of the 100 replicates for this setting is shown in Figure 1 (solid line), clearly showing the expected property of the test statistic profile for QTL mapping.

Figure 1
figure 1

Likelihood ratio test statistic profiles for the combined A/D model ŵs=ŵd=0.15 and δ=0.1 with sample size n =100. The simulated position of the viability locus is located at position 25 cM (indicated by the solid bar). The solid line is the average profile of 100 replicates, the dotted line is the profile of a randomly picked single run from the 100 replicates and the dashed horizontal line is the threshold value for the test statistic at α = 0.05.

The empirical critical values appear to be quite independent of the sample size and they are about 14.5 at α = 0.05 and about 18.0 at α = 0.01. These empirical critical values are clearly larger than χ23,0.95 = 7.815 and χ23,0.99 = 11.34. Therefore, we used the empirical critical values to declare significance.

Means and standard deviations of the estimated parameters for various genetic models are given in Table 2 for n=50, Table 3 for n =100 and Table 4 for n = 200. The results do follow the expected trends: the viability locus location is more accurately estimated as the sample size and the selection intensity increase. When the sample size is small, the estimated position of the locus is severely biased towards the center of the chromosome. Besides these general trends, we found that the additive models are more sensitive to the intensity of selection. Under different levels of parameters (high, medium and low), the accuracy of both the estimated viability locus location and parameters varies more than the dominance models and both additive and dominance (A/D) model. Overall, the A/D model gives the highest accuracy of estimation.

Table 2 Means and standard deviations (in parentheses) of estimated parameter values for the EM algorithm with sample size 50
Table 3 Means and standard deviations (in parentheses) of estimated parameter values for the EM algorithm with sample size 100
Table 4 Means and standard deviations (in parentheses) of estimated parameter values for the EM algorithm with sample size 200

The empirical statistical powers under various genetic models and sample sizes are given in Table 5. The powers are quite low for small sample size (n=50) and are reasonably high when sample size reaches 200. These observations are the same as those expected in the more usual QTL mapping studies. The results of these simulations have verified the derivations of our methods and the computer programs; more importantly, they have demonstrated that viability locus mapping can be accomplished following the usual approach of QTL mapping.

Table 5 Empirical statistical powers (%) under type I error rates of 0.05 and 0.01

Discussion

The fitness considered here is a special fitness component, the viability, which relates to the change of gene frequencies in the current generation where the mapping individuals are collected. Another major fitness component is the fecundity, that is, the number of progenies produced by the individual of interest. Fecundity is also related to the change of gene frequencies, but it affects the gene frequencies in the next generation. Fecundity is measured quantitatively and thus mapping fecundity loci can be directly accomplished using standard QTL mapping approaches. Therefore, we only focused on the statistics of mapping viability loci in this study. The ultimate result of viability selection in a population is the change in gene frequencies, but if we concentrate on one particular family or pedigree, the result of viability selection is the deviation of allelic segregation from the expected Mendelian ratio. The non-Mendelian segregation of a viability locus causes deviation from Mendelian segregation for markers linked to the viability locus. The viability considered in this study is defined in the adult stage (genotype). However, the statistics developed allow us to separate the gametic selection from zygotic selection. The maternal and paternal allelic effects represent the gametic selection and the dominance effect represents the zygotic selection.

The purpose of the simulation studies is to demonstrate that viability mapping can be performed in the same way as QTL mapping. There was no attempt to explore the range of parameter values in which the method works better than for other ranges. That would require extensive simulation studies with exhaustive combinations of parameter values. However, from the results of the limited simulation experiments, we conclude that the sample size should be sufficiently large to be able to detect a locus subject to a weak selection. For the parameter values selected in our simulation experiments, n≥200 seems to be reasonable.

In the evolutionary literature, the fitness of the best genotype (the maximum fitness) is usually set to unity and the fitness values of all other genotypes are then expressed as lower values than unity (Hartl and Clark, 1997). As a result of the restriction, wmax = 1, the fitness defined in this way is called the relative fitness. The fitness values defined in this study are also relative fitness but with a different restriction, ∑klwkl = 1. The difference in the restriction has no effect on the estimation and statistical tests. This has been verified by our simulation studies where we converted the fitness values into selection coefficients by setting wmax = 1 and expressed the selection coefficients as skl = 1 − wkl/wmax. The estimated fitness values are very close to the true values simulated. In fact, researchers often convert the relative fitness into selection coefficients as we did in the simulations and investigate the magnitudes of the selection coefficients. In natural populations, people often concentrate on the biallelic situation with only three phenotypes: A1A1, A1A2 and A2A2. Using the selection coefficients, researchers are able to investigate the degree of dominance. If the A1A1 is the fittest genotype, the fitness values of the three genotypes are defined as w11 = 1, w12 = 1−hs and w22 = 1−s, respectively, where s represents the ‘additive effect’ or gametic selection and h represents the ‘degree of dominance’ or zygotic selection. We simply used a different but more general notation to handle multiple alleles.

Mapping viability loci has only been investigated in line-crossing experiments (Fu and Ritland, 1994a; Mitchell-Olds, 1995; Vogl and Xu, 2000). Results are only rarely inferred to natural populations, which are usually outbred. A full-sib family is the simplest case that can be studied from an outbred population. This research is the first attempt to extend viability mapping to outbred populations. The results can be easily extended to more complicated outbred pedigrees, commonly seen in humans, trees and large animals. In pedigree analysis, we focus on the relative representation of founder alleles. Each founder carries two alternative alleles at any locus. Under Mendelian segregation (no viability selection), the two alleles should be equally represented in the descendents. However, if there is evidence that the two alleles are not equally represented, the locus may be subject to viability selection. The allele comparisons from different founders can be combined to increase the power of viability locus detection. The multiple allelic model in pedigree analysis may be investigated as follows. Assume that there are F/2 founders with a total of F founder alleles. The model parameters may be set up in an F × F table as shown in Table 6. The fitness of genotype AkAl is wkl for k,l = 1,⃛,F, which is partitioned as wkl = w.k w.l+δkl, where wk.=w.k is the proportion of allele Ak represented in the mapping population and δkl=δlk is the dominance effect. Notice the symmetry of the definitions. The parameters of the viability locus are wk. and δkl for k,l = 1, ⃛, F. Restrictions are required to make the model estimable and they are ∑Fk=1w.k = 1 and ∑Fk.=1δkl = 0 for all l. To test the hypothesis that there is no gametic selection, we test wk.μk=0 for all k, where μk is the theoretical proportion of the presence of allele Ak in the mapping population and can be calculated based on pedigree information. For example, in a diallele mating design, μk = 1/F for all k. To test the hypothesis of no zygotic selection, we test δkl = 0 for all k and l. In fact, it is convenient to formulate viability mapping in pedigrees as a random model problem where we are interested in testing

Table 6 The definitions of fitness parameters for an outbred population with F founder alleles

Of course, the founder alleles cannot be traced without marker information. Inference of the relative contributions of the founder alleles in the descendents is not easy. We need to invoke the recurrent algorithm of Yi and Xu (2001) to trace the allelic origin of each allele in the mapping population. If missing markers are involved, the descent graph algorithm of Sobel and Lange (1996) is also needed. In addition, we will need to adopt the Bayesian method implemented via the Markov chain Monte Carlo (Gelman et al, 1995; Green, 1995; Satagopan and Yandell, 1996; Heath, 1997; Richardson and Green, 1997; Sillanpaa and Arjas, 1998; Stephens and Fisch, 1998; Vogl and Xu, 2000). The detailed algorithm has not been worked out and development of such an algorithm is our next project.

Our assumption is that segregation distortion is caused by viability selection. However, it may also be caused by genotyping errors. If genotyping errors happen randomly across loci and genotypes within loci, they may not bias our results but increase the errors of our detection and estimation. This can be compensated by increasing the sample size. However, if genotyping errors happen in a systematic manner, that is, some genotype is more often scored as another genotype, then the result will be confounded with segregation distortion. If we know this a priori, we may put a flag on the genotype and treat the genotype as incomplete. For example, if A1A2 is often scored as A1A1 for some markers, the experimenter should warn us that ‘A1A1’ may have a certain probability of being A1A2. This probability may be incorporated into our analysis. This incomplete information is still useful because we are sure that ‘A1A1’ is not A2A2.

We have not investigated multiple viability loci in this study, simply because the single locus model can also be applied to the search for multiple loci. This is equivalent to the interval mapping of Lander and Botstein (1989), which is a single QTL mapping procedure but has been used to search for multiple QTLs by most people. When two or more loci are located in a single chromosome but with a large distance between pairs of loci, multiple loci can be detected from multiple separated peaks of the test statistic profiles. Similar to multiple QTL mapping, when two viability loci are close together or there is an interaction (non-multiplicative) effect between the two loci, the model needs to be revised to take into consideration multiple viability loci, as the multiple QTL model of Kao et al (1999). The situation of multiple locus viability mapping in pedigrees is extremely complicated. A Bayesian approach should be considered (Sheehan and Thomas, 1993; Lin et al, 1993,1994; Lin, 1995,1996; Hoeschele et al, 1997). The ML analysis for a single locus proposed in this study serves as a necessary first step towards a full solution of viability mapping.