Introduction

Complex diseases often involve the interplay of many genes and environmental factors. The approach of genome-wide association study (GWAS) has shown great potential in dealing with such complexity, as demonstrated by the many exciting recent findings.1 By scanning hundreds of thousands of single-nucleotide polymorphisms (SNPs), GWAS is less biased than candidate gene approach and more powerful than traditional linkage analysis in detecting genetic variants that confer modest disease risk. It provides new opportunities to evaluate a group of genes as a whole for potentially important synergetic effects. However, most published GWAS studies focus on marginal effects of individual SNPs. Seldom has the interplay of multiple genes been carefully evaluated, largely because of prohibitive computational burden and a lack of appropriate tools. In this study, we present a gene-set-based approach for evaluating the collective action of multiple SNPs in many genes with companion software that aims at reduced computational burden by taking advantage of known genetic pathways.

In gene expression analysis, the method of gene-set enrichment analysis (GSEA),2 was proposed to overcome similar problems of prohibitive computation and lack of modeling for interactions. To measure the degree of ‘enrichment’ of association signals in a gene set, GSEA uses ‘enrichment score’ to summarize the association test scores of every gene in the gene set. If a gene set is over-represented by functional genes that are relevant to the disease of interest, GSEA will have enhanced power by combining strength of multiple genes. Indeed, the gene set identified by early GSEA application3 led to subsequent functional validation of connections between impaired mitochondrial activity and insulin resistance.4, 5 The approach may be extended to GWAS studies. However, a key modification is needed for a scoring method that can combine association signals of multiple SNPs to obtain a single value representing the importance of that gene to the disease trait. Once such a ‘gene score’ is obtained, enrichment of association in gene sets can be tested as before.

In GWASs, the number of SNPs in a gene varies from a handful to hundreds, with only one or a few responsible for functional divergence, or in linkage disequilibrium (LD) with causal variants. Ideally, only such SNPs should be used to represent the gene. Therefore, a gene score may be defined by the maximum test statistics among all SNPs in a gene.6 But genes with more genotyped SNPs or weak LD among them may score higher this way just by chance. A partial remedy for such a bias is to normalize gene-set enrichment scores. For example, the original GSEA method did so by scaling using the mean score estimated from permutation tests and by using the method described by Wang et al6 by standardization.

An alternative approach is to perform normalization at the gene-score level, so that biases due to varying gene size and LD structure are minimized before deriving gene-set enrichment scores. We present several methods for normalizing gene scores and their utility in extending GSEA to GWAS. The new approach can evaluate enrichment of associated variables in predefined sets of genes, SNPs, and is termed ‘variable-set enrichment analysis’ (VSEA). An R software package (also named VSEA) is developed to implement the new methods, with a collection of gene sets constructed from known biological pathways to facilitate real GWAS analysis. We present evaluations of the new methods by simulation and analysis of real GWAS data, and discuss how VSEA is related to other existing gene-set-based methods and their suitable applications in different scenarios.

Methods

The GSEA Approach

Subramanian's GSEA method

This method was proposed in gene expression analysis to detect significant ‘enrichment’ of disease association in a gene set. Let be the list of all genes measured on a gene expression microarray. Sorting their association test statistic values (for example, t-test score) from largest to smallest, we get and a ranked gene list . Let be the gene-set of interest. The enrichment in S is evaluated in three steps:

Calculate the enrichment score (ES): A Kolmogorov–Smirnov statistic is used to measure over-representation of S at the top of the ranked list L. It is calculated by examining the ranked gene list L, increasing a running sum statistic when encountering a gene in S and by decreasing it when not. The enrichment score is defined as the maximum of the running sum statistics:

where D denotes observed data, S the number of genes in S, , and P a tuning parameter giving higher weights to highly associated genes (P=1 was recommended in gene expression studies2).

Evaluate the significance of the enrichment score by permutation: ES is repeatedly calculated after shuffling disease status of samples, which generates an empirical null distribution of ES. An empirical P-value is estimated by the proportion of permutations that result in larger ES than originally observed.

Correct for multiple testing: When a large number of gene sets are tested simultaneously, false discovery rate (FDR) or family-wise error rate (FWER) may be used to correct for multiple testing. The enrichment scores are normalized first to minimize undesirable effects of varying gene-set size and within-gene-set correlations. Finally, the normalized ES (NES) is defined as:

where ES(S, π) is the enrichment score for permutation π.

Extension of GSEA to GWAS

A key step in extending GSEA to GWAS studies is to derive a summary score that combines signals from individual SNPs in each gene.6 Denote the SNPs of gene Gk as , and their association test statistic as . In Wang's extension of GSEA to GWAS,6 gene score for Gk is assigned the highest test statistic value among all the SNPs, max(tkj). Following this, the enrichment scores are calculated as in the original GSEA. To adjust for multiple testing, normalized ES is calculated as follows:

We note that a summary ‘gene score’ simply defined by the maximum of per-SNP statistic favors larger genes (more genotyped SNPs with weak LD) because they may score higher by chance as a result of the greater number of independent tests. Therefore, new methods are needed to correct such biases for improved performance of gene-set enrichment analysis in GWAS.

New methods using normalized gene scores

We propose to perform normalization of gene scores before calculating gene-set enrichment scores, making the gene scores comparable for genes of different sizes. Several alternatives are introduced below in addition to the max statistic used by Wang et al.6

Gene scores based on maximum SNP statistics

The first class of gene scores aims to normalize the maximum SNP statistics by their empirical distribution derived from permutation analysis. For gene Gk, denote the observed maximum statistic as mk(D) and the permuted mk(π). We will consider new gene scores defined as follows based on maximum SNP statistics:

where Φ is the standard normal cumulative distribution function. CHI2MEAN and CHIMEAN adjust for gene sizes by scaling and CHI2 and CHI by standardization, all based on the distribution of permuted per-SNP test scores. The square root function used in CHIMEAN and CHI was motivated by experiments using various transformations to obtain more comparable gene-score distributions.

Gene scores based on other multilocus test statistics

We consider two multilocus statistics. One is Hotelling's T2-test, as implemented in the PLINK software.7 The other is proposed by Zhou et al8 to summarize SNP associations within a gene. We name the two gene scores as ‘T2’ and ‘LCMT’. For both methods, the P-value for a gene is calculated first and then mapped to a χ2 distribution with one degree of freedom or the square root of the distribution. The mapped statistic is used as the gene score in the VSEA test.

Gene scores based on tagging SNPs

In this class of methods, we bypass the use of genes as analysis units and instead use a set of representative SNPs selected from these genes. Success of such tests depends on how to properly choose the SNPs. Wang's gene score is a special case that selects single SNPs with maximum statistics. At the other extreme, we may include all SNPs from all genes in the gene set (termed as ‘ALL_SNP’). Between these extremes, we considered three other SNP-set-based methods.

TAG: Tag SNPs are selected as representatives using the method by Meng et al9 and Lin and Altman.10

PCA1: For each gene, principal component analysis (PCA) is first performed on allele count correlation matrix. The eigenvectors for the first few components are used to obtain linear combinations of the SNP association test statistics and form the ‘pseudo-SNP’ statistic, which is further used in ‘SNP-set’ enrichment analysis:

Where λk is the kth largest eigenvalue and ek is the corresponding eigenvector.

PCA2: Similar to ‘PCA1’, this method uses another definition for ‘pseudo-SNP’ statistic:

where operator o stands for the element-wise product of two vector/matrix of the same size. When there are SNPs in complete LD within the gene, both PCA methods will remove extra copies of repeated SNP statistics.

Implementation of the new algorithm

The gene scores presented above are implemented in the following procedure to perform VSEA in GWAS.

  1. 1)

    Perform GWAS by calculating single SNP and/or multilocus association test scores. PLINK7 is called to generate the test statistics for both original and permuted data sets.

  2. 2)

    Calculate normalized gene scores; for SNP-set-based scoring algorithms, this step is replaced by selecting representative SNPs.

  3. 3)

    Calculate enrichment score following the original GSEA algorithm for each gene set.

  4. 4)

    Evaluate significance of the enrichment scores by contrasting with that in permuted data.

  5. 5)

    Normalize enrichment scores by scaling (formula 2) or standardization (formula 3) when multiple gene sets are tested, correcting for multiple testing by FDR/FWER.

The procedure is implemented in an R package called VSEA with utility tools for data formatting and annotation of results. It uses PLINK for computational intensive calculations, in which more flexible modeling is available (for example, tests with environmental covariates included).

Simulation study

Simulation study is used to evaluate the performance of the proposed new gene scores and to compare two approaches with normalizing enrichment scores.

We simulated 10 disease models, all with the same prevalence of 25%, and six interacting disease loci. Disease SNPs contribute to the binary trait through interactions, which were specified through penetrances of joint genotypes at the six disease loci. As shown in Table 1, the minor allele frequencies (MAFs) of the six loci are 5%, 10%, 20%, 30%, 40% and 50%, respectively. The 10 scenarios are divided into two groups for lower (12%) and higher (20%) values of heritability. Each of the two groups is comprised of five scenarios: (S1) no marginal effect is observed for any of the six loci; (S2-1) two loci with low MAF have moderate marginal effects (OR=1.2 in recessive model) and the other four have no marginal effects. (S2-2) Similar to 2 but the two loci with moderate marginal effects have high MAFs; (S3-1) and (S3-2) are similar to (S2-1) and (S2-2), but two loci have strong marginal effects (OR=2 in recessive model). For each of the 10 scenarios, we considered 10 joint-penetrance tables randomly generated to satisfy the marginal constraints. Case–control data sets were then simulated using the SNAP software.11

Table 1 Ratio of marginal genotypic effects at each individual locus in the 10 scenarios of simulation study

Every simulated data set has 300 SNPs consisting of 30 blocks, each with 10 SNPs. SNPs are in high LD within blocks but LD-free across blocks. The six disease loci locate in the first six blocks, respectively. Their positions in corresponding blocks are totally random. We used 750 cases and 750 controls for all 10 scenarios. As we will see later, from the results using 750 cases and 750 controls, scenarios S2-1C1, S2-1C2, S2-2C2 and S2-2C2 tend to have lower power, thus additional sample sizes were used for these scenarios (1500:1500 and 3000:3000) to see how power would be improved; likewise, smaller samples were also used for scenarios that have higher power (250:250 for S3-1C1, S3-1C2, S3-2C2 and S3-2C2).

To define a gene in the simulated data, we selected a random streak of SNPs around each disease locus from the corresponding block. The six disease genes form a gene set in the VSEA test. Further, we defined a reference gene set consisting of irrelevant SNPs similarly, by randomly selecting six blocks that do not have disease loci, and took one locus from each of them to locate a ‘gene’ around it. Power and false-positive rates were calculated based on 100 replicates for each penetrance tables. Further, results were averaged over the 10-penetrance table configurations used in each scenario.

Before calculating gene scores, single SNP tests were performed using the 1 d.f. allelic test implemented by PLINK. In general, the power of VSEA is also dependent on the choice of appropriate single SNP test methods despite the gene score method that is used.

Real GWAS data analysis

To evaluate its practical use, we also applied VSEA to a pilot GWAS data set that characterizes the cardiovascular structural and functional manifestations of hypertension, collectively termed hypertensive heart disease (HHD).

Curation of pathway-based gene sets

To facilitate real data analysis, we constructed biologically interesting pathways from publicly available databases, including KEGG,12 GO13 and BioCarta. We retrieved 179 pathways from KEGG (release 44.0) and 313 pathways from BioCarta (November 2007). For pathways in GO, we constructed gene sets on the basis of GO level 4 annotations for biological process and molecular function. Some nodes in GO level 4 were excluded if they occured in levels 2 and 3 as well. For nodes in level 5 and onward, their genes were assigned to their ancestral GO annotations in level 4. Construction of these GO gene sets was carried out using the goTools package and the GO database from BioConductor 2.1, which used the 200708 release of GO. We obtained 2150 gene sets from GO database. The exact number of gene sets in analysis will differ depending on the genotyping platform for each GWAS data set. Detailed breakdown of gene sets constructed from KEGG, GO and BioCarta for some Affymetrix SNP (Affymetrix Inc., Santa Clara, CA, USA) array platforms and our HHD pilot GWAS data are presented in Table 2. In the real data analysis, we only tested gene sets that contain at least three but no more than 200 genes represented by markers in a given GWA data set, to alleviate multiple testing and to avoid testing overly narrow or broad functional categories.

Table 2 Numbers of gene sets constructed from KEGG, BioCarta and GO for various genotyping platforms

Example GWAS data set

To illustrate the utility of VSEA, we applied the method to a pilot data set of 74 cases and 70 controls from a GWAS study of HHD. Hypertension affects millions of people and HHD is associated with elevated cardiovascular morbidity and mortality.14 Several ongoing GWAS studies attempt to characterize the genetic components of hypertension and related diseases.15 In the pilot HHD sample, SNPs were genotyped using the Affymetrix Mapping 500K Array Set. The data of the SNPs underwent quality control (QC) using commonly accepted criteria on array quality (missing rate ≤0.05, mean heterozygosity between 0.25 and 0.3) and on marker quality (call rate ≥0.99 for SNPs with MAF ≤0.05, call rate ≥0.95 for all other SNPs and Hardy–Weinberg test P-value >10−6). A total of 389 344 SNPs passed QC, which were mapped to genes based on annotation provided by Affymetrix. There were 15 863 genes associated with a total of 320 747 distinct SNPs available for VSEA analysis. We then applied VSEA to this data set using different gene score methods and 1000 permutations, and analyzed empirical distributions of statistics and compared results between gene score methods.

Results

Simulation study

We first checked empirical false-positive rates of the proposed VSEA tests. As shown is Figure 1, the mean false-positive rate fluctuates around the nominal P-values for all scenarios considered by the simulation study. This indicates that the algorithms proposed for calculating a summary gene score, all performed reasonably well and their corresponding VSEA tests are comparable with each other in terms of type-I error rates.

Figure 1
figure 1

Mean false-positive rate of VSEA, estimated by averaging over 100 replicates of simulated data sets, under each of the scenarios considered by the simulation study. Sample sizes are shown at the end of each scenario name.

However, the power of the VSEA test can differ substantially depending on the gene score method used. Among the algorithms based on maximum SNP statistics, most of our proposed methods (except for ‘ABSZ’ and ‘ABSZ2’), including ‘CHI’, ‘CHI2’, ‘CHIMEAN’ and ‘CHI2MEAN’, performed better or similar to ‘WANG’ under all 10 scenarios considered with various sample sizes. The all-time-best performer is ‘CHI2’, which was consistently more powerful than others. As shown in Figure 2a, at a significance level of α=0.05 (nominal P-value), average improvement in power ranged between 14% to as much as 40% depending on the underlying model and the sample size.

Figure 2
figure 2

Estimated power of VSEA test by simulation study, using gene scores based on (a) maximum SNP statistics; (b) multilocus analysis; and (c) SNP-set enrichment. Sample sizes are shown at the end of each scenario name. The left panels use 750 cases and 750 controls, whereas the right panel uses other sample sizes (1500:1500, 3000:3000 and 250:250, identified by labels of the x-axis).

Using multilocus gene scores does improve VSEA test power in some scenarios (Figure 2b). LCMT usually has comparable or greater power compared with Wang's maximum statistics, except in S3-2C1 and S3-2C2 using 750:750 samples, in which there are extremely large single SNP effects. The other multilocus method, Hotelling's T2, has slightly better power than Wang's maximum statistics in most cases, except when some risk SNPs have relatively large marginal effects (S3-1, S3-2).

VSEA tests using gene scores based on most of the tagging SNP-set-based methods performed at least as good as Wang's GSEA (see Figure 2c). The ‘ALL_SNP’ method seemed to be one of the best performers. However, even the best of these methods was still not as powerful as the ‘CHI2’ method.

When there was no marginal effect in any of the six disease SNPs (S1C1 and S1C2), all VSEA methods seemed to have no power at all (power close to nominal P-value). When there were even weak marginal effects in two of the six disease SNPs (S2-1C1, S2-1C2, S2-2C1 and S2-2C2), VSEA methods started to gain power and became useful with studies of larger sample sizes. For example, at α=0.05, VSEA test using the ‘WANG’ method had average power from 0.128 to 0.154 for the four configurations with a sample size of 750:750; the power increased to 0.205–0.264 for a sample size of 1500:1500 and to 0.373–0.566 for a sample size of 3000:3000. VSEA test using the more powerful ‘CHI2’ method achieved a power of 0.172–0.286 with a sample size of 750:750, to 0.345–0.555 for a sample size of 1500:1500 and to 0.694–0.919 for a sample size of 3000:3000. When there were strong marginal effects in two of the six disease SNPs, VSEA tests using all the proposed gene-score-calculating methods yielded good results. ‘CHI2’, ‘CHI’, ‘CHI2MEAN’ and sometimes ‘WANG’ had powers close to 1 with 750 cases and 750 controls. In these scenarios, power of VSEA remained high even when we drop the sample sizes to 250 cases and 250 controls. In particular, ‘CHI2’ had an average power of 0.99 for configurations S3-2C1 and S3-2C2, and around 0.68 for S3-1C1 and S3-1C2.

Real GWAS data

As the gene score method ‘CHI2’ outperformed others in the simulation study, we only report below results using ‘CHI2’ and those using ‘WANG’ for comparison. In the real data analysis, the PLINK permutation test (step-1 of the VSEA algorithm) took 37 min on a single thread on a 2.4 GHz Dual-Core AMD Opteron Processor 2216 (PerformanceWare 1475, Pogo Linux, Inc., Redmond, WA, USA). After that, steps 2–4 of the VSEA analysis using ‘WANG’ or ‘CHI2’ gene score method each took 14 h.

Nominal P-values from calculation using ‘CHI2’ and ‘WANG’ gene scores have a rank correlation of 0.65. In Table 3, we list the top 10 gene sets with smallest P-values by VSEA analyses using the ‘CHI2’ gene score method. For comparison, we also included ranks of the gene sets by the ‘WANG’ method. A similar table based on WANG is in Supplementary Table S1. Although there was some overlap between the two methods, a majority of the top 10 gene sets selected by the two methods were different. This could be related to the small sample size of the example data set (no gene set reached a significance level of 0.05 after adjustment for multiple testing). Therefore, biological interpretation of these results may be limited by the small sample size of this example data set.

Table 3 Top gene sets identified by VSEA test (using CHI2 gene score) in the HHD pilot data

Nevertheless, the real data on genome-wide SNPs in a large number of genes and gene sets allowed us to examine the empirical null distributions of the proposed gene score and enrichment score statistics. This helped us to confirm that the objective of using the proposed statistics was accomplished; that is, normalization of gene scores had made the resulting statistics more comparable in spite of different gene sizes. Some results are shown in Figures 3, in which QQ plots of gene scores by various methods for 100 randomly selected pairs of genes are displayed. Each pair comprises of a gene with just a few (from 3 to 5) SNPs and another with many SNPs (≥100). In the QQ plots, gene scores were calculated in the permuted data sets based on the HHD pilot GWAS data set, and the quantiles of the gene scores in permutation tests for one gene were plotted against that for the other gene in the pair. They reflect the (dis)agreement between null distributions between large (y-axis) and small (x-axis) genes. Therefore, methods that produce the plots tightly centered around the diagonals are more superior in terms of generating comparable gene scores.

Figure 3
figure 3

Comparison of gene-score distributions between large and small genes show effects of different gene-score normalization methods. Each panel presents overlaid QQ plots of gene-score distributions for 100 randomly selected pairs of genes in the example GWAS data set. Each pair comprises of a small gene with just a few (from 3 to 5) SNPs and a large one with at least 100 SNPs. The empirical distribution of gene scores was generated by calculating in the permuted data sets. The quantiles of the two distributions corresponding to each gene of the pair were then plotted against each other. In the plots, x-axis represents the smaller gene in the pair and y-axis represents the larger gene. Each panel is for a specific gene-score method: (a) WANG, (b) CHI2, (c) CHI, (d) ABSZ, (e) CHI2MEAN (see text for details of each method).

As seen in Figure 3, using the simple maximum statistic gene score without normalization, large genes tend to have much greater gene scores than smaller ones. On the other hand, using ‘CHI2’, ‘CHI’ or ‘ABSZ’ methods made distributions much more comparable between large and small genes, although ‘CHI2MEAN’ seemed to bias against large genes.

Similarly, comparisons were made between two normalized enrichment scores (formulae 2 and 3) to examine their effect in deriving comparable enrichment scores even when the sizes (number of genes) of gene sets may vary. In Figure 4, QQ plots are displayed to compare the empirical distributions of normalized enrichment scores for 100 randomly selected pairs of gene sets, using the ‘WANG’ (top row) and ‘CHI2’ gene scores (bottom row), respectively. For each gene-score method, three strategies of treating gene-set enrichment scores were compared: unnormalized (left), normalized by scaling (middle) and normalized by standardization (right). The plots reflect the (dis)agreement between null distributions for different gene sets. Methods that generated the plots that were centered more tightly around the diagonals resulted in normalized scores that are less sensitive to gene-set sizes, and therefore are more suitable for multiple testing corrections. It is clear in Figure 4 that regardless of the gene score method used, normalization helped in deriving more comparable enrichment scores, and the standardization approach used by Wang is better than the simple scaling approach used in the original GSEA.

Figure 4
figure 4

Comparison of enrichment scores when different enrichment score normalization methods were used. Each panel presents overlaid QQ plots of distributions of enrichment scores calculated in the example GWAS data, using 100 randomly selected pairs of gene sets. The empirical distribution of enrichment scores was estimated using 1000 permuted data sets. For each pair of gene sets, the quantiles of the two distributions corresponding to each set of the pair were plotted against each other. Columns of the figure correspond to three approaches to normalization and the rows to methods used for calculating gene scores: top row (ac) using the ‘WANG’ method and bottom row (df) using the ‘CHI2’ method.

Discussion

We presented new statistics for summarizing single SNP association test results in genes and gene sets for VSEA to evaluate enrichment of disease association in predefined gene sets. Similar to the original GSEA, VSEA tests whether the members of a gene set tend to co-occur near the top of the gene list ranked by single SNP analysis. We make several observations based on our evaluation of VSEA, using simulation and real GWAS data sets.

Gene sets as the basic unit for testing genotype–phenotype association

Traditional candidate gene studies consider single variants or haplotypes as basic units of analysis and generally had low replication of findings.16, 17, 18 Because of differences in LD structure across populations, tagging SNPs or haplotypes of the risk variants could be quite different. This is particularly relevant under the common-disease–common-variant hypothesis, in which differential recombination histories across populations could have occurred during the long time before a mutation became prevalent. On the other hand, if there are multiple rare variants that are fairly recent in origin, allelic/locus heterogeneity is more likely across populations. In both cases, using single variants as basic units presents a challenge to successful replication studies.

In contrast, the positions, sequences and functionality of genes are highly consistent across diverse human populations. Furthermore, in complex diseases, groups of genes tend to work together and it seems reasonable to take gene sets as the unit for association analysis. This also has the added value of known genetic pathways and biological processes related to the disease of interest. This idea was successfully demonstrated in gene expression studies.3, 4, 5 As gene expression (transcribed mRNA) levels connect genetic variations to clinically observed disease phenotypes, the value of GESA-type test in GWAS could be inferred. In fact, studies of so-called expression quantitative trait loci19, 20, 21 convincingly demonstrated that genetic variations, together with environmental stimuli, may influence the location, timing and/or level of gene transcription. The findings support the abundance of cis-regulatory variations key to phenotypic variations in humans,22 and motivate the continuing development of gene-set-based methods for GWAS studies.

Other gene-set-based methods available for GWAS

Several newly published methods also use the gene-set approach.23, 24, 25, 26 For example, Chasman23 studied a GSEA-type method by pooling all SNPs in genes of a gene set to form a new set of SNPs with a hypergeometric test for enrichment. It is a special case of the ‘tag-SNPs-set-based’ VSEA, without accounting for the effect of varying gene sizes and gene-set sizes. They showed that the gene-set-based method was generally more efficient than conventional single-variant-based method when there are many variants with small effects. Peng et al25 proposed a procedure similar to that of ours by testing for single SNP associations first, followed by gene-wise and then pathway-based analysis. However, the method was based solely on P-values with an underlying assumption about the independence of single SNP tests. In contrast, VSEA is based on a permutation procedure that properly accommodates the dependence of tests among single SNPs and those among genes. Therefore, Peng's method provides a quick way of reusing published P-values without genotype data, and VSEA is suitable for in-depth GWAS analysis when genotype data are available.

In general, enrichment test of gene-sets can be constructed using a broad range of statistics for gene scores, followed by restandardization of the gene scores, using permutation of phenotypes or genes or both.27 The permutation procedure in GSEA-type method is known as ‘phenotype shuffling’. A different kind of permutation is ‘gene shuffling’, in which gene scores are only calculated once, and then the enrichment score for a given gene set is calculated by comparison with randomly selected gene sets of the same size. Thus, gene shuffling can be much faster. However, the two permutation strategies have different underlying null hypotheses.28, 29 Moreover, whereas phenotype shuffling preserves important correlation structures between genes and among SNPs and is biologically more meaningful, gene shuffling risks destroying them. Therefore, the latter is not recommended unless the two strategies are combined to form a possibly more powerful test.27 It remains to be seen whether this can materialize in GWAS because shuffling both phenotypes and genes simultaneously require significantly more computation.

The VSEA method is aimed at improving the ‘accuracy’ of gene scores so that they are less dependent on the size of genes and gene sets in consideration. Our study is not without its own limitations. For example, the simulation involved only 300 SNPs and a limited number of genes because of the lack of a realistic GWAS simulator that can handle complex interaction models. The method also did not explicitly model or directly test for interaction effects. Instead, it relies on predefined gene sets to capture a large collection of weak marginal effects, a common drawback of existing gene-set-based methods. Nevertheless, novel and improved GSEA methods continue being developed30 in gene expression analysis. For GWAS studies, the persisting problem of ‘missing heritability’31 demands also for similar methods to detect collective actions of many risk factors. Therefore, further investigation is warranted for robust gene-set- and pathway-based methods that can more effectively incorporate biological information and be capable of providing functional understanding of new findings about the disease of interest.