Introduction

Genome-wide association (GWA) is a popular technique for disease gene mapping of complex traits. The availability of microarrays has made GWA technically possible but it is prohibitively costly for many researchers. A cost efficient alternative to individual genotyping is DNA pooling,1 an approach recently extended to use arrays.2, 3, 4 With array-based pooling, well-powered GWA studies can be conducted at vastly reduced cost, bringing them well within the reach of most laboratories.2 The primary factor which affects the efficiency of pooling compared with individual genotyping is the magnitude of the pooling variance. Appreciation of the sources of variation is critical to the efficient allocation of resources in terms of the number of arrays and the number of pools used.

Previously, Macgregor et al2 presented pooling data using Affymetrix arrays but did not address the composition of the pooling variance. Here is shown that by examining variation between and within pools, it is possible to partition the variation into a component attributable to error on the arrays (ie, ‘technical’ error) and a component owing to errors in pooling construction. This demonstrates that most of the error in pooling is attributable to variation on the arrays and that the error introduced when pool are carefully constructed is of substantially less importance. For optimal efficiency, resources should be allocated in increasing the number of arrays per pool rather than constructing multiple pools.

Materials and methods

Data

Full details of the data used are given elsewhere.2, 5 In brief, genomic DNA was extracted (using the same method throughout) from peripheral venous blood samples collected in the period 1997–2003. Two DNA pools (case and control) of 384 individuals were constructed by mixing equal amounts of adjusted DNA samples. Three Affymetrix Genechip HindIII arrays (56494 SNPs) were applied to each pool.

Statistical methods

Sources of error with pooling

With pooling there are a number of sources of error. The sample frequency estimate, a, from pooled data can be written (cf. appendix 1 in Macgregor et al2)

where pa is the true population frequency, a is the estimate of the frequency in that sample (this does not equal true population frequency, pa, because of binomial sampling error), eb is the binomial sampling error, epool_array is the error associated with estimating the frequency from the pool on an array and epool_construction is the error associated with creating a pool.

Different estimates of pooling variance

Estimates of pooling variance using a single sample

There are two methods for estimating the array variance from a single sample; the first method is simplest to outline and applies straightforwardly to the case where there are two array measures from same pool. The second method is given subsequently. With case pool sample estimates ai (for controls replace a with u) on array i (i=1,2)

where a is the true frequency in that set of cases. The variance of the difference is

and var(epool_array) is estimated using

where var(a1a2) is obtained by calculating the average of the squared differences between a1 and a2 across the full set of SNPs on the array. var(epool_array) is assumed constant across SNPs. When there are more than two arrays, multiple pairings of array measures are possible and the best estimate of var(epool_array) is the average over all pairs.

An alternative method, which applies immediately to the case where there are more than two arrays per pool, is to fit an analysis of variance to the set of ai values. This second method gives similar results to the first method on the data used here (three arrays per pool).

In Macgregor et al2 the three arrays (per case or control pool) were taken together and a quality control (QC) step applied. This step discarded SNPs with <8 probe measurements available across the three arrays. Here the arrays are considered separately and a per-array QC step implemented; this involved discarding SNPs with <2 probe measurements on the array under study.

Estimates of pooling variance using cases and controls

Macgregor et al2 describe a method that estimates the pooling variance from the cases and controls (summarized in appendix in supplementary online material). Unlike the case described above for estimating the pooling variance using a single sample, when cases and controls are used there is an additional component of variation owing to random (binomial) sampling. This sampling is explicitly accounted for the method described by Macgregor et al.2 In this case, the two possible sources of pooling error are confounded and it is only possible to estimate a single variance (containing both the array pooling variance and the pool construction variance); this is henceforth referred to as var(epool_total).

To allow a suitable comparison with the estimates of pooling variance from a single pool, the estimate of var(epool_total) was calculated by considering each of the nine possible pairwise comparison between the case and control pools (ie, case pool array 1 vs control pool array 1, case pool array 1 vs control pool array 2, …). The overall estimate of var(epool_total) was then averaged over all pairs. The same QC step that was applied to the single sample analysis was used. The estimate of var(epool_total) will not equal the pooling variance estimate reported in Macgregor et al2 (which used the same data as used here but calculated the pool variance on all three arrays) because in that case the estimate of var(epool_total) was a compound of the array-specific error (which is three times smaller with three arrays than with one array) and the pooling construction error (which is unaffected by the number of arrays). Furthermore, as above, a slightly different QC step was used when all three arrays were taken together.

Pooling construction variance estimates

var(epool_construction) cannot be calculated directly from these data. However, as there are separate estimates of var(epool_array) (from single pools) and var(epool_total) (from case–control differences), var(epool_construction) can be estimated by subtraction

An alternative estimate of var(epool_construction) can also be calculated from the two possible estimates of var(epool_total). The first estimate, denoted var(epool_total_arrays_pairwise), from the average of the nine pairwise combinations given above yields an estimate of var(epool_array)+var(epool_construction). The second estimate, denoted var(epool_total_3_arrays), from the three case pool arrays together vs the three control pool array together yields an estimate of var(epool_array)/3+var(epool_construction) (this was what was calculated in Macgregor et al2). Re-arranging the previous two equations (solving the system of equations) yields

Calculations were carried out using R.6

Results

The estimates of var(epool_array) were 0.00118 and 0.00133 for control and case pools, respectively. The overall estimate of var(epool_array) over both pools was 0.00126. The estimate of var(epool_total) was 0.00144 (average over all nine possible pairs). Subtracting the estimate of var(epool_array) from var(epool_total) gives an estimate of var(epool_construction) of 0.00018. In terms of variance explained, this suggests that only 12.5% of the variance in pooling is due to pooling construction. In all 87.5% of the variance is due to array variation.

The pooling variance estimate from Macgregor et al2 was 0.00058, based on three arrays. By contrasting this estimate with the one obtained from the nine possible pairwise combinations of case–control, an alternative estimate of var(epool_construction) is 0.00015. In this case a slightly different QC step is applied so this may account for the slight difference between this estimate and the one in the previous paragraph.

Discussion

The success of array-based pooling depends upon reducing the overall pooling error and the results here suggest that the majority of this error arises as a result of array-specific variability. To reduce the array-specific variance several arrays should be used per pool. Based on the variance seen in the data used here, up to seven Affymetrix arrays could have been used per pool before the pooling construction variance would have become larger than the array-specific variance. In some previous array-based pooling studies,4, 7 smaller numbers of individuals (N=10–20) were placed in each pool. This contrasts with the large number (N=384) used here. The work presented here suggests that, as the pooling error is largely array-specific error, using larger numbers of arrays on smaller numbers of pools (with more individuals per pool) will be more effective than smaller numbers of arrays on larger numbers of pools. As discussed in Macgregor et al,2 the overall optimal study design will varying depending on the size of the overall pooling variance relative to the binomial sampling variance.

The estimates of var(epool_construction) were relatively small but replication of this result in other pools will be important. For the experiment described here, pools were carefully constructed following estimation of DNA concentrations in a step down procedure to achieve final DNA concentrations of 25 ng/μl (±0.55) before pooling.5 It is difficult to know from a single data set how much variability there will be in the estimate of var(epool_construction) and the overall levels of pooling construction variance will likely vary across laboratories. As the estimate of var(epool_construction) calculated here was based on a limited number of arrays, the confidence interval on the estimate of var(epool_construction) may not be particularly narrow.

In the above analysis the focus was on array variation being the source of technical variation. There are a number of technical steps necessary to produce data from pools and it is likely that both PCR variation and hybridization variation contribute to the overall technical variation. An experiment, which recycled the reaction product for multiple hybridizations would allow partition of the technical variation.

A number of assumptions were made in the analysis (see also Macgregor et al2 for further coverage). Firstly, all SNPs were assumed to be unassociated with disease; this will hold for virtually all SNPs. Secondly, the pooling variance was assumed to be constant across SNPs on the array; no strong evidence was found for systematic variation, particularly for SNPs with allele frequencies in the range of primary interest (0.1–0.9). Finally, unequal amplification of alleles was assumed to not affect results; the focus was on the difference in allele frequencies (between case/control or between arrays 1 and 2 on a given pool, and so on) so this is unlikely to be an issue.