Introduction

The development and progression of common diseases apparently occur through interactions of multiple factors that include those for life-style, environment, aging, and genes. One popular theory for the genetic basis of common disease susceptibility is the “common disease–common variant” hypothesis (Wright and Hastie 2001), which predicts that an association study can identify multiple genes with risk-modifying common alleles, each with low-to-moderate genetic relative risks. Unveiling such relatively common but “weak” disease genes may lead to at least three important developments in medicine in the postsequencing era: First, a high-population-attributable risk may aid in the identification of a high-risk population, which would benefit from specific preventive interventions, such as chemoprevention or a life-style modification clinic. Second, once all or a majority of the risk-modifying genes and alleles are disclosed, the disease risk for each individual can be evaluated with good accuracy (Pharoah et al. 2002). Third, but most importantly, identification of disease-associated genes should lead us to the uncovering of the molecular pathogenesis of a disease, which is the ultimate basis for development of targeted therapy, diagnosis, and prevention.

Even though the human genome sequencing per se was declared over, about 40% of the identified or predicted genes are tagged with no known functions (Venter et al. 2001; Okazaki et al. 2002), and it is obvious that our knowledge of the remaining “annotated” genes is far less complete. The realization of this serious dearth of knowledge has prompted many scientists to advocate a genome-wide screen for disease-gene hunting in addition to the candidate gene approach, which relies upon an a priori selection of small numbers of candidate genes based on the existing information or hypothesis.

Two crucial fundamentals have been developed for the genome screen in Japan: First, a very high-throughput, low-cost, and DNA-saving single nucleotide polymorphism (SNP) typing platform (Ohnishi et al. 2001); and second, a high-quality and comprehensive gene-centric SNP database with allele frequency information among the Japanese population, designated JSNP (Haga et al. 2002; Hirakawa et al. 2002). The combined power of the two fundamentals crystallized in the first success in the whole-genome association study led by RIKEN, identifying functional SNPs that are associated with susceptibility to myocardial infarction (Ozaki et al. 2002), under the auspices of the government-sponsored Millennium Genome Project of Japan. In 2001, another whole-genome association study based on JSNP was launched within the Millennium Genome Project to identify genes associated with Alzheimer’s, gastric cancer, diabetes, hypertension, and asthma (Yoshida 2002; Yoshida and Yoshimura 2003). This five-disease, joint, whole-genome association study has been organized as a collaboration of many institutions in Japan, including RIKEN, which provided the high-throughput typing technology (Ohnishi et al. 2001) and a significant portion of the typing itself. The number of SNPs analyzed by the first-stage screening in the whole-genome association study was about 100,000 per person, and a total of 940 patients (188 patients per disease) were genotyped. In the five-disease, whole-genome association study, 188 patients (cases) and 752 patients with other diseases (controls) were compared for each disease calculating sample odds ratios (ORs) for each SNP. In general, a common disease is expected to have more than ten genes with a population OR of 1.5–2.0 (Wright et al. 1999; Pharoah et al. 2002; Ponder 2001).

When 100,000 SNPs are analyzed in the first-stage screening, numerous false-positive results ensue due to the multiplicity of choices. Following the selection of a certain number of SNPs (e.g., 2,000) based on sample ORs in the first screening, one or more succeeding stages of screening are obviously required. However, for the succeeding stage experiments, the number of typings that can be performed is limited by the available resources, such as research funds, project period, and amount of DNA. The objective of the present study is to find the best design for the second and later stages of the whole-genome association study under the given fixed amount of total SNP typings. In this particular whole-genome association study, we assume a total of 3,008,000 typings (2,000 SNPs × 1,504 people, comprising 752 cases and 752 controls) for each disease.

Satagopan et al. (2002) compared the one-stage design with the two-stage design and showed that if the total number of typings is fixed, the probability for identifying disease-associated SNPs was maximized by a two-stage design. They concluded that 75% of the total cost should be used in the first stage to select the top 10% of SNPs as candidates for the second-stage screening, in which the remaining budget should be spent to select SNPs with a large value of test statistics to maximize the detection rate. However, their conclusion does not give a direct answer to our specific questions for the following reasons: first, Satagopan et al. (2002) compared one- versus two-stage designs for the first-time association study. They assumed a fixed number of candidate SNPs to be analyzed but varying numbers of subjects for the two designs. However, we need to choose one or two additional stages following the finished first stage of the genome screen (i.e., total two stages versus three stages, respectively). Because our “candidate” SNPs are selected from the first-stage experiment on the 100,000 SNPs, we can assign different numbers of SNPs to the two designs. On the other hand, unlike Satagopan et al. (2002), the total number of the subjects is fixed in our case (752 cases and 752 controls) because the subjects ascertained are the most important asset in our analysis. Second, their model assumes test statistics to be approximated to a normal distribution. The most conceivable statistic for such an approximation in a case-control study is a chi-square statistic, as mentioned in their paper, but we would like to evaluate possible designs without restricting the test statistics to the chi-square; for instance, other options include direct comparison of a sample OR and its P value, or P value calculated by Fisher’s exact test. Third, they used power PK, or probability to detect all of the true markers, to judge the performance of the different designs. By contrast, we introduced hit rate (Rh), which we believe to be more practical in the search for the genes implicated in polygenic complex traits.

In this study, to choose the optimal design for a whole-genome association study, a Monte Carlo simulation was undertaken to compare two- versus three-stage designs, which share a common 100,000 genome screen as their first stage. Of the five target diseases in the project, we examined gastric cancer as a representative in this study.

Materials and methods

Two-screening designs

Two- and three-stage whole-genome association studies are defined as follows in this study (Table 1).

Table 1 Number of SNPs and subjects (number of cases, controls) analyzed in this whole-genome association study

Two-stage design (A)

A-1 (first stage)

In 940 patients (188 gastric cancer cases and 752 people with four other diseases combined as “controls”), 100,000 SNPs per person were typed to determine a sample OR for each SNP to select a candidate SNP (n1 SNP) in descending order of magnitude of sample ORs.

A-2 (second stage)

A separate panel of 752 gastric cancer cases and 752 controls (individuals without gastric cancer) were chosen, and n1 SNPs per person were genotyped and sample ORs were calculated. SNPs with a sample OR above a cutoff value, c, are selected as disease-associated SNPs.

Three-stage design (B)

B-1 (first stage)

The experiment was conducted in the same manner as the two-stage design to select candidate SNPs with high sample ORs (n2 SNPs).

B-2 (second stage)

The second panel of 376 gastric cancer cases and 376 nongastric cancer controls were chosen, and the sample OR for the n2 SNPs was calculated to further narrow down the number of candidate SNPs to n3.

B-3 (third stage)

In the third panel of 376 cases and 376 controls, the n3 SNPs were genotyped, and the sample OR was calculated by combining data collected in steps B-3 and B-2 to identify the gastric-cancer-associated SNPs with sample ORs above the cutoff value, c.

Framework of simulation experiment

In the present study, a Monte Carlo simulation (Metropolis et al. 1953) experiment was used to compare the performance of the two designs described above. At each SNP, the association between its genotypes and disease status was tested by a 2×2 contingency table (Table 2). Here we assume an either dominant or recessive model, and for an SNP with allele A and allele a, the genotype AA combined with genotype Aa was compared with the genotype aa. An OR was defined as:

$$ \psi = \frac{{n_{11} \times n_{22} }} {{n_{12} \times n_{21} }}. $$
(1)

When each cell in Table 2was large, log ψ approximately had a normal distribution with true log ψ (mean) and \(\sqrt {\frac{1} {{n_{11} }} + \frac{1} {{n_{12} }} + \frac{1} {{n_{21} }} + \frac{1} {{n_{22} }}} \) (standard deviation) (Agresti 1990). The population OR for the gastric-cancer-associated SNPs was ψ (ψ>1.0), while the population OR for SNPs unrelated to the disease was 1.0.

Table 2 Number of genotypes for cases and controls

The number of the true SNPs (SNPs truly associated with the disease) was designated np, which is given as a simulation parameter. The sample log ORs of those np SNPs was generated separately from that of the non-disease-associated SNPs, the number of which is 100,000-np. We regarded SNPs experimentally positive when their sample ORs were above a cutoff value, another simulation parameter. The number of the experimentally positive SNPs was then designated N, which contains Np disease-associated (true-positive) SNPs out of the original np SNPs. The number of false-positive SNPs (non-disease-associated SNPs) was therefore NNp. While N and Np represent random variables controlled by experimental errors, np is a constant assigned as a simulation parameter.

As performance indicators for each design, two indicators defined by formulas 2 and 3, Rh (hit rate) and Rd (detection rate) were used. Hereinafter, these indicators will be expressed as percentages.

$$ R_{\text{h}} = \frac{{N_{\text{p}} }} {N} $$
(2)
$$ R_{\text{d}} = \frac{{N_{\text{p}} }} {{n_{\text{p}} }} $$
(3)

Rh is a proportion of true disease-associated SNPs among positive SNPs (positive predictive value) while Rd is the proportion of positively detected SNPs among true disease-associated SNPs (sensitivity). Ideally, both values should approach 100%, but since these two parameters are opposing in nature, an optimal balance between them must be assessed.

Algorithm of simulation experiment

The algorithm of the two-stage design, whole-genome association study is as follows:

A-1 (first stage)

  1. 1.

    We generated log ORs randomly for the np “true” markers (i.e., SNPs truly associated with the disease) in the first stage according to a normal distribution with mean log ψ and standard deviation \(\sqrt {\frac{1} {{94}} + \frac{1} {{94}} + \frac{1} {{376}} + \frac{1} {{376}}} .\) Here, np, the number of such true SNPs, was set at 100 based on inference from the literature. The population OR, ψ, was variably set from 1.3 to 1.9 with an increment of 0.2. The standard deviation was chosen as the minimum possible value, which corresponds to a disease-associated genotype frequency of 50% (see below in Design parameters).

  2. 2.

    We generated log ORs randomly for the (100,000-np) non-disease-associated SNPs according to a normal distribution with mean 0 and standard deviation \(\sqrt {\frac{1} {{94}} + \frac{1} {{94}} + \frac{1} {{376}} + \frac{1} {{376}}} .\)

  3. 3.

    The log ORs generated above were combined and sorted in descending order of the values and then were selected up to the n1th SNPs. Here, n1 was fixed at 2,000.

A-2 (second stage)

  1. 4.

    We generated log ORs randomly for the np’ true markers in the second stage according to a normal distribution with mean log ψ and standard deviation \(\sqrt {\frac{1} {{376}} + \frac{1} {{376}} + \frac{1} {{376}} + \frac{1} {{376}}} .\) Here, np’ is the number of SNPs that are truly associated with the disease and were chosen in the first-stage screening (the starting number of the true SNPs was np). The number of the subjects analyzed in the second stage was 752 cases and 752 controls, who are individuals different from those analyzed in the first stage.

  2. 5.

    We generated log ORs randomly for the n1np’ non-disease-associated SNPs according to a normal distribution, with mean 0 and standard deviation \(\sqrt {\frac{1} {{376}} + \frac{1} {{376}} + \frac{1} {{376}} + \frac{1} {{376}}} .\)

  3. 6.

    The log ORs generated in (4) and (5) for the second-stage typing were combined and sorted in descending order. The SNPs were considered disease-associated when their ORs exceeded the cutoff value (c). The value of “c” was variably set from 1.0 to 2.5, with an incremental of 0.1.

  4. 7.

    We calculated the hit rate Rh and detection rate Rd, as defined above.

  5. 8.

    Steps (1)–(7) were repeated 10,000 times.

The algorithm of the three-stage design is essentially identical to the two-stage design except for parameters such as standard deviation and the number of SNPs analyzed at each stage; standard deviation of log ψ used in the simulation for the second and third stages in the three-stage design is \(\sqrt {\frac{1} {{188}} + \frac{1} {{188}} + \frac{1} {{188}} + \frac{1} {{188}}} .\)

Design parameters

The two designs in this whole-genome association study include constants that should be set as simulation parameters: c, n1, n2, and n3. The value of “c” is identical for both designs and is set from 1.0 to 2.5, with an incremental pitch of 0.1.

In this whole-genome association study, the first-stage screening was already finished, and n1 (the number of SNPs analyzed in the second stage of the two-stage design of the whole-genome association study) was fixed at 2,000. In the case of the three-stage design, the total 752 cases and 752 controls need to be divided into second- and third-stage typings. Because our typing system requires sample numbers to be equal to or a multiple of 188, we initially compared the following combinations of case or control number for the second stage/third stage: 188/564, 376/376, and 564/188. We chose the 376/376 combination because it gave the best Rh and Rd for the SNPs for ψ (population OR) ≥1.5 (Table 3). It follows, then. that n2 and n3were determined so that their sum equaled 4,000 (n2>n3) to keep the total typing cost the same between the two- and three-stage designs. In the three-stage design, n2, was set variably from 2,500 to 3,500, with an incremental pitch of 500, and n3was automatically set for each n2. Table 1 shows the relationship between the number of subjects and the number of SNPs for the two designs. The number of simulations was set at 10,000.

Table 3 Hit rates Rh (upper figures) and detection rates Rd (lower figures) for various combinations of the number of subjects in the three-stage designa

A possible effect of genotype frequencies on the simulation was first evaluated by changing the disease-associated genotype frequencies in the range of 10–50%. Because the two- and three-stage designs showed essentially the same relative pattern at each genotype frequency (Table 5), here we show the result of the simulation with the minimum standard deviation of a sample log OR, as described above in Algorithm of simulation experiment; the difference between the two designs should best be illustrated with the minimum standard deviation, which corresponds to the assumption of 50% disease-associated genotype frequencies.

Results and discussion

Comparison of the two designs

Table 4 shows Rh and Rd (%) when the cutoff value, c, is set at 1.6. Taking into account the number of simulations, digits after the second decimal point are considered irrelevant. From the perspective of comparing the two experimental designs, the column with ψ=1.7 is the most important, since Rd was about 60%. Under the three-stage design, when n2 is set at 2,000, Rh and Rd are almost identical to those of the two-stage design (n1=2,000). When n2 is increased, Rd is also elevated in the three-stage design, but the gain in Rd is modest—only about 3.8% higher than that of the two-stage design.

Table 4 Hit rates Rh (upper figures) and detection rates Rd (lower figures) in percent

Table 4 also shows the Rh and Rd at the second (mid) stage of the three-stage design. Comparing Rh of the second stage with that of the final third stage, it is noted that the third stage is necessary to suppress false positives in the three-stage design. On the other hand, when the population OR (ψ) is less than 1.5, Rd was actually more decreased in the third stage than in the second stage.

According to the study by Satagopan et al. (2002), the two-stage design is more advantageous than the one-stage design, but our analysis revealed only a marginal gain in Rd. Since sample ORs for SNPs unrelated to the disease did not exceed the cutoff value of 1.6, Rh was almost 100%; only 60% were identified in both designs (Figs. 1, 2). Moreover, the typing laboratory is burdened with the additional chore of having to divide the whole typing job into three stages instead of two, such as an increase in the complexity of the system and in the difficulty in finding the optimal combination of multiplex PCR. Overall, the limited extent of the predicted increase in Rd does not warrant the adoption of the more complicated three-stage design in the whole-genome association study in the Millennium Genome Project of Japan, which favors the simple two-stage scheme.

Fig. 1
figure 1

Hit rate Rh and detection rate Rd against cutoff value c for the two-stage design. Broken line Rh, solid line Rd

Fig. 2
figure 2

Hit rate Rh and detection rate Rd against cutoff value c for the three-stage design. Broken line Rh, solid line Rd

Although an accurate estimation of the number of disease-associated SNPs may not be possible, Sing et al. (1996) suggested that about 100 SNPs or about 30 genes determine susceptibility to coronary artery disease. Wright et al. (1999), Pharoah et al. (2002) and Ponder (2001) also documented the involvement of similar numbers of disease-associated genes in other common diseases. Setting the number of disease-associated SNPs at 100 was thus not completely without grounds. However, even when the number of disease-associated SNPs, np, is set at 10, 30, 50, or 70, simulations showed results similar to those using an np of 100 (data not shown). It should be pointed out, however, that the present study is based on simulation, and that it is also necessary to seek theoretically optimal conditions.

To assess the effect of genotype frequencies, the simulation was repeated with the genotype frequency varying from 10 to 50%. Although the frequencies influenced power significantly, the comparison between the two- and three-stage designs showed essentially the same relative pattern (Table 5), confirming that our final conclusion is not significantly affected by the difference in genotype frequency.

Table 5 Hit rates Rh (upper rows) and detection rates Rd (lower rows) in percent for different genotype frequencies (10–50%)

Selection of cutoff value

In Table 4, the cutoff value was fixed at 1.6. Figures 1 (the two-stage design) and 2 (the three-stage design, n2=3,500) show changes in Rh and Rd with changing cutoff values. The choice of c should primarily depend on the purpose of the genome screen or its role in the overall gene-hunting scheme; some investigators may request a higher positive predictive value (Rh) while sacrificing sensitivity (Rd), yet other researchers may choose the other way around. One simple way to deal with the trade-off between Rh and Rd may be to select c as intersection points of the Rh and Rd curves for each ψ value, i.e., (ψ: c)=(1.3:1.24), (1.5:1.26), (1.7:1.30), and (1.9:1.38). With these cutoff values, the degree of improvement in Rd for the three-stage design is about the same as that mentioned above for c=1.6. However, it is generally recommended that a cutoff value be set a little lower than ψ, and collecting information about population ORs of the disease-associated SNPs is clearly important.

Population OR for gastric cancer

Using the real experimental data of about 80,000 SNPs on gastric cancer from the first stages of the whole-genome association study, the frequency distribution of sample ORs was calculated and shown in Fig. 3. About 90% of the SNPs fall within the range of a sample OR from 0.6 to 1.7, suggesting that disease-associated SNPs within this range of a population OR would be difficult to detect by the sample size of the current project. In other words, the number of subjects needs to be increased to detect gastric-cancer-associated SNPs with population ORs smaller than 1.7 in this whole-genome association study. It was, therefore, reasonable to compare the performance of the screening designs under the condition in which the population OR, ψ, and cutoff value are set to about 1.7 and 1.6, respectively.

Fig. 3
figure 3

Observed distribution of genotype ORs for gastric cancer in the first stage of the whole-genome association study in the Millennium Genome Project of Japan

It should be noted, however, that our sample ORs were measured between individuals with gastric cancer and those with the other four diseases combined (“control”) in the present five-disease whole-genome association study. Thus, the control population for the first stage is different from that of the second- or third-stage screening in which the controls were specifically ascertained as those without gastric cancer. As a consequence, some ambiguity may exist in our assessment of the population ORs for common gastric cancer.

Issues to be addressed in future studies

In the present simulation, each SNP is assumed to be distributed independently among the subjects. However, this is not always the case because significant linkage disequilibrium (LD) is often noted over pairs of the SNPs. In the future, correction of the present study may be necessary, taking into account the LD structure of the relevant SNPs.