Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
Technical Comments
|
|
12 distribution (the
2 distribution
on one degree of freedom). When allele A is associated with
disease,
> 1, and the statistic X2
follows the noncentral
1,
2 distribution, with
noncentrality parameter
|
,
, and
p are taken as fixed values, and n is a
variable in
. As in Risch and Merikangas, the significance level for
a given marker locus is set at
= 5 × 10
8, to give a
genome-wide significance level of 5%.
, the power of the test for a
single marker, is then the probability that a
1,
2 variable
exceeds Q = 29.72, the quantile of the
12 distribution
corresponding to the per locus significance level. As the
1,
2 variable is equal in distribution to
Y2, where Y is a normal
(
, 1) variable, it follows that
= 1
pr
(
< Y <
). Converting to standard deviates and taking
the area added in the lower tail as negligible,
is approximately 1
(
), where
is the
cumulative distribution function for a standard normal variable.
Setting
at 0.8 determines the value that
must equal, and then one can solve for n. The
2 distributions have been used mainly to facilitate
sample size calculations; given marker-disease data, one would probably
use Fisher's exact test to detect associations.
Risch and Merikangas (1) show the great power of genetic association studies such as the TDT in the detection of genes with modest effects. As they mention, all TDT computations were based on the optimal assumption that the analyzed allele was the disease allele itself. A more common situation is, and could well remain, the analysis of polymorphisms which have a low prior probability to be the disease allele even if they are within the actual disease gene. The power of the TDT is highly dependent not only on the linkage disequilibrium between the disease allele and the analyzed allele but also on the relative frequencies of both these alleles.
With the same genetic model as that used by Risch and Merikangas--a
disease locus with two alleles, A and a, with
population frequencies of p and 1-p,
respectively, and a multiplicative model with genotypic relative risks
of
and
2 for Aa and AA
subjects, respectively--one can assume a closely linked diallelic
marker (recombination fraction = 0) with alleles B and
b of respective frequencies m and 1-m.
The coefficient of linkage disequilibrium,
, is defined as
freq(AB)-pm, and the maximum value of
,
max, is reached with freq(AB) is the lowest of the two frequencies m and p. The probability
that a heterozygous Bb subject carries A in
coupling when B is
1 = p +
/m,
and the probability that the same subject carries A in
coupling with b is
2 = p
/(1
m) (2). In a sample of single affected
individuals with their parents, the probability for a Bb
parent to transmit B to his affected child is
P(tr
B) = [1 + (
1)
1]/[2 + (
1)(
1 +
2)] (3). The situation described by Risch
and Merikangas corresponds to complete linkage disequilibrium, that is,
=
max with m = p, with
P(tr
B) reducing to
/(1 +
). In other
cases, the number of necessary families increases dramatically as
p differs from m even when
=
max, and also as
decreases. Thus, the power of
association studies such as the TDT can be quite strong when there is a
high probability that the allele studied is the causal allele as shown
by Risch and Merikangas. In other cases, researchers should be aware
that the power of such association studies can be greatly diminished as
soon as the ratio m/p departs from unity and the linkage
disequilibrium becomes weaker.
Bertram Müller-Myhsok
Department of Molecular Genetics,
Bernhard Nocht Institute
for
Tropical Medicine,
D-20359 Hamburg, Germany
Laurent Abel
Institut National de la
Santé et de la Re
cherche
Médicale (INSERM) U.436,
Mathematical and
Statistical Modeling
in Biology and Medicine,
Hôpital
Pitié-Salpêtrière,
F-75013 Paris, France
1 = p(A/B) = p(AB)/P(B). Given that
p(AB) is
+ pm and
p(B) = m, we find
1 =
/m + p. The value for
2 is
obtained in an analogous fashion.
B) = p(aff/B)/[p(aff/B) + p(aff/b)] since the prior probabilities of
transmitting B and b are equal to 0.5 p(aff/B) is [
1 + (1
1)]D, and p(aff/b)
is [
2 + (1
2)]D,
where D is the probability that a subject is affected given
he carries allele a. Thus, after some algebra,
P(tr
B) = [1 + (
1)
1]/[2 + (
1)(
1 +
2)].
Response: We agree with Scott et al. that linkage analysis will be able to identify genes of major, but not genes of modest, effect. As such, we also agree that linkage analysis should not be arbitrarily abandoned, because undoubtedly it will lead to the discovery of some important disease susceptibility genes. However, we do not agree that linkage analysis can detect most genes underlying complex diseases, and we anticipate that few genes for complex disorders will be identified in this fashion.
As indicated by Scott et al., one measure of the total
genetic effect for a complex disease is
s, the sibling risk ratio (1). However, it is generally impossible to determine the
number of loci contributing to that total; if the number is large, even for a large value of
s, then none of the loci
may be easily detected by linkage analysis.
We showed in our Perspective (2) that loci which confer a
genotypic relative risk
less than 4 would be difficult or
impossible to identify with current linkage strategies. The numbers of
sib pairs required to detect linkage that were given in the table in
our Perspective (2, p. 1516) were actually underestimates, for two reasons: (i) There was an error in the computer program producing the required number of sib pairs for linkage; the actual numbers are approximately 50% larger than given (3); and
(ii) the numbers given correspond to the ideal case of completely
informative markers and no recombination. Allowing for more realistic
circumstances of reduced marker informativity and moderate
recombination, the corrected numbers probably would be about two to
three times larger than given in the table. Thus, while it is still
possible to detect a locus with
of 4 or greater in a large family
collection (say 500 or more), loci with smaller values of
are
unlikely to be detected.
How many loci are likely to exist for complex diseases with
> 4?
While it is difficult to know beforehand, animal models might offer a
clue. As an example, the non-obese diabetic (NOD) mouse provides a
useful model for human insulin dependent diabetes mellitus in being
genetically complex, having an autoimmune etiology, and in the
importance of the major histocompatibility loci. However, backcross
experiments have shown that at least 10 other loci are probably
involved in susceptibility, and only one of these loci had a value of
greater than 4, with the rest in the range of 2 or less
(4). We also note that animal backcross experiments are more
analogous to human association studies than linkage studies, and this
is why they have been more successful in identifying susceptibility
loci than human linkage studies.
As indicated by Scott et al., multiple sclerosis is a complex disease with a presumed substantial genetic component (5). However, three recently published genome screens (6) of moderate size did not produce clear and replicable evidence of linkage in any chromosomal region. This lack of susceptibility loci of large effect in this disease suggests that a very large number of families may be required to detect linkage.
The discovery of apoE as a major risk factor for late onset
Alzheimer's disease is surely one of the major success stories of
modern human genetics. Thus, it is important to evaluate the means by
which this discovery was made. As indicated by Scott et al.,
it has been estimated that apoE confers a
s value of around 2, with some
modification for age of onset (7). Thus, in theory, this
locus would be identifiable by linkage analysis with a sufficient
number of sib pairs (several hundred minimum). In fact, the initial
linkage observation on chromosome 19 (8), which produced a
lod score of 4, was based on an analysis with markers that were likely
to be in linkage disequilibrium with apoE. Performing linkage analysis
with a marker associated with disease leads to an increase in the lod
score (9). Similar linkage analysis with a nearby marker
with little or no linkage disequilibrium (for example, the apo CII
microsatellite) in the same material does not produce significant
evidence for linkage (8, 10). Thus, in reality, the
"linkage" discovery on chromosome 19 was actually based on an
association between marker loci and the disease.
We agree with Scott et al. that genome-wide association studies will be based on future rather than current technology (as indicated in our title), and for the present we are still limited to the technology that exists. Although we agree that linkage studies should continue to be pursued, we also believe that this approach will produce only a modest number of loci for complex diseases.
We agree with Bell and Taylor that candidate genes are best tested in the framework of a biological hypothesis, often involving an interaction with a predisposing environmental agent, and the examples they provide are illuminating [for others, see (11)]. Also, as they point out, classic epidemiologic study designs, such as case-control or cohort, are excellent for testing such gene-environment interaction effects. The primary drawback from such designs for detecting genetic effects, however, is the potential for confounding, leading to an incorrect inference of causality for an observed association (12). Specifically, consider a population that has ethnic stratification and a tendency toward endogamy within strata. Further suppose these strata differ both in disease prevalence and allele frequencies at an unrelated locus. When performing a case-control study from such an admixed population, if the cases and controls are unbalanced for these strata, an allele frequency difference between cases and controls may emerge which is artifactual and not causal. The solution, of course, is to precisely match the cases and controls according to these strata, or to perform a stratified analysis; such would be possible with the major ethnic groups such as exist in the United States. However, further strata are likely to exist within the major ethnic groupings (for example, European subgroups of Caucasians) for which matching and stratification might generally be quite difficult. Of course, this problem disappears in a completely randomly mating population.
This problem can also be solved by resorting to family-based association tests, such as the TDT we used in our analysis. This test has been shown to be immune to confounding due to population stratification (13). Also, in the absence of population stratification, this test has similar power to the usual case-control design (14). Furthermore, cases or families (or both) can also be classified according to a relevant environmental exposure and allelic transmission compared across these classes to search for gene-environment interactions. We also showed that unless the disease predisposing allele frequency is high, families with more than one affected child can be substantially more powerful than singletons, although they are also likely to be more difficult to find.
Because of the potential problem of genetic stratification, the optimal design for searching for genes of modest effect, especially in the absence of a clear biological model, is the family-based design, such as singleton or multiple affected sibs with parents. For early onset diseases, such samples should not be difficult to obtain, and are likely worth the potential additional cost. We would add that precise ethnic matching in a case-control paradigm can also lead to increased expense, if achievable at all. In the situation of late onset diseases, where parents are usually unavailable, an alternative design is discordant sib pairs, where effectively an unaffected sib serves as a control for the affected sib. This design also protects against genetic stratification artifact, but may lead to somewhat reduced power because of the genetic correlation between sibs (14).
Long et al. suggest a prospective study design where a random population sample is subsequently followed for development of disease. Presumably, at initiation, everyone in the study is genotyped for a large number of loci. They show that if the disease is sufficiently common, reasonable power is obtained by contrasting the allele frequencies in those who develop the disease with those that do not. The primary benefit from this approach is that multiple diseases can be studied using the same population of subjects, again provided the diseases are sufficiently common. It would appear that a minimum frequency of 10% is required to obtain plausible sample sizes for sufficient power.
There are also several drawbacks to this approach. First, as for the typical epidemiologic paradigms, such as case-control studies, there is the problem of population substructure as we have described (in our response to Bell and Taylor) and also mentioned by Long et al. Second, with this approach, sample pooling is not possible, because it is unknown a priori which individuals will become affected. Thus, this approach requires construction of individual genotypes, which can greatly magnify the technical effort. By contrast, for a typical case-control design, two pools can be formed--one for affected individuals, another for those unaffected, and overall allele frequencies within the two groups determined. Thus, for a study of n cases and n controls and t loci, genotypes for only 2t samples need to be determined as opposed to 2nt samples (15). The same efficiency may obtain for a family-based design, such as affected individuals and their parents, where those affected are pooled and contrasted to the pooled group of parents. While this approach cannot give the precise data needed for a TDT analysis, it still provides a robust, powerful, and efficient means for initial screening; any positive loci can subsequently be subjected to individual genotyping (14).
The approach of Long et al. would not be practical for rare diseases, for example, those with a population frequency less than 5%. However, a compromise is possible. Numerous studies already exist that sample affected individuals, with parents or unaffected sibs, for a variety of diseases. The subjects from these studies can be followed for a variety of other diseases and then subjected to analysis as they develop these other, more frequent diseases. Pooling across studies could then provide sufficient material.
As indicated by Müller-Myhsok and Abel, our analysis was based on association studies where the actual disease predisposing polymorphism is in hand. This is why we incorporated such a large number of tested alleles (1,000,000). We also indicated that the number of loci to be tested might be reduceable substantially if one allows for linkage disequilibrium. However, as pointed out by Müller-Myhsok and Abel, depending on linkage disequilibrium is not without risk. The power of the association test can decline dramatically as linkage disequilibrium diminishes or if the tested allele has a substantially different frequency than the disease allele. To a large extent, the expectation with regard to linkage disequilibrium across the genome is uncharted territory, and thus it is difficult to predict the power of using a less dense map at this point in time. However, we can present two cases that provide some degree of optimism. The first pertains to apoE and late onset Alzheimer's disease. Several polymorphisms in the apoE region show strong linkage disequilibrium and comparable allele frequencies, allowing association to be readily detected with other neighboring polymorphisms (16). A second example is the insulin VNTR region of chromosome 11p. Several polymorphisms in this region have been identified showing strong disequilibrium and similar allele frequencies, leading to comparable degrees of association with disease (17).
As genome-wide linkage studies are supplanted by genome-wide association studies, and the distribution of linkage disequilibrium across chromosomes and populations is further explored, the degree to which linkage disequilibrium as opposed to direct causality can be utilized to locate disease susceptibility loci in the genome will become more apparent.
Neil Risch
Department of Genetics,
Stanford University School of Medicine,
Stanford, CA 94305-5120, USA
E-mail: risch{at}lahmed.stanford.edu
Kathleen Merikangas
Department of Epidemiology,
Yale University School of Medicine,
New Haven, CT 06510, USA
E-mail: kath{at}zeus.psych.yale.edu
was inadvertently
fixed at 0. Because in most calculations the value of
was close to
1, the correct values are approximately 1.5 times the values given in
the table.
Science. ISSN 0036-8075 (print), 1095-9203 (online)