Introduction

Although Genome-Wide Association Studies (GWAS) have been successful in identifying associations between common variants and complex traits and diseases, much of the heritability of these traits and diseases remains unexplained.1 Recently, there has been a deepening interest in evaluating the extent to which rare variants contribute to variation in complex traits and diseases.2, 3, 4, 5, 6, 7, 8, 9 This has motivated development of statistical methods for testing rare-variant associations at the gene level.10, 11, 12, 13, 14, 15, 16 Although these methods are useful for increasing statistical power to detect associations relative to single-variant analyses, valid well-powered statistical analyses are contingent on careful examination of phenotypes and underlying assumptions. Here we explore some commonly encountered issues with how phenotypes are distributed and how these issues affect inferences from rare-variant association tests.

One assumption underlying many rare-variant association studies is that rare variants exert larger effect sizes than common variants. For some traits, this hypothesis is borne out by genetic evidence. For instance, the LDLR gene harbors multiple rare variants that are strongly associated with circulating low-density lipoprotein (LDL)-cholesterol levels.7 The genetic effects of these rare variants are so strong that individuals carrying certain LDLR mutations appear as outliers in population level summaries of LDL-cholesterol levels.7 Rare-variant association studies of complex traits are particularly interested in phenotypic outliers because they may harbor rare variants with strong genetic effects.

Furthermore, many rare-variant association tests rely on asymptotics that work best with normally (or approximately normally) distributed phenotypes.12, 16 However, many quantitative phenotypes are not normally distributed in healthy populations (even after controlling for confounders that may contribute to non-normality).17, 18 We show that rare-variant association tests are uniquely susceptible to biases caused by outliers and non-normality.

Materials and Methods

Simulations

We considered two different types of trait distributions. To simulate outliers, we generated a mixture of normal random variables by choosing 95% of the values to be drawn from a standard N(0,1) distribution and the other 5% from N(0, σ=8). To simulate non-normal phenotypes that are similar to those observed in genetic studies, we randomly generated highly skewed random variables from the distribution. As a secondary analysis, we simulated phenotypes using a left-skewed Gompertz distribution and a mixture of χ2 distributions by drawing 95% of the trait values from a and 5% from a distribution. Histograms of simulated trait distributions are shown in Supplementary Figure S5. Genotypes were drawn as 0, 1, or 2, from the multinomial distribution, with probabilities derived from Hardy–Weinberg equilibrium with specified minor allele frequencies.

Given our randomly generated genotypes, we also simulated heteroskedasticity (unequal error variance between genotypes) by drawing the ith phenotypic value Yi from N(0,1) if Xi=0, from N(0,1.5) if Xi=1, and from N(0,2) if Xi=2. In this manner, we simulated quantitative traits with no mean shift in trait value between genotypes, but where the genotype predicts the variance of the trait values.

We considered several different approaches for dealing with outliers. Winsorising (WINS) is a technique that limits the influence of extreme values by setting all outliers to a specified percentile of the observed data. We considered a 95% winsorization, where we set all observations below the 5th percentile or above the 95th percentile to the values observed at the 5th and 95th percentile, respectively. We also evaluated deleting outliers (DEL), where all values below the 5th percentile or above the 95th percentile were removed. In comparison to winsorising or deleting outliers, we also obtained empirical P-values by permuting the quantitative trait values (PERMUTE) one million times. In addition, we performed robust regression using the M-estimator (HUBER),19 as implemented in the rlm() package in R (R Foundation for Statistical Computing, Vienna, Austria). Finally, we performed the rank-based inverse normal transformation (INT), where all values of the trait are ranked, and the ranks are mapped to percentiles of the standard normal distributions. Specifically, the transformed value of the phenotype for the ith subject was:

where ri is the rank of the ith observation among a sample of size n, and Ф−1 denotes the standard normal quantile function. We also considered the natural logarithm transformation (LOG-NORM), as well as the Kruskal–Wallis non-parametric test (K–W; implemented with the kruskal.test() function in R, R Foundation for Statistical Computing). Note that while WINS, DEL, LOG-NORM, and INT can be considered trait transformations, PERMUTE, HUBER, and K–W are procedures that do not transform the phenotype values.

For simulations of single-genetic variants, we considered minor allele frequencies of both 0.005 and 0.05. We ran 100 000 iterations for both the Type I error and Power simulations. Type I error and Power were evaluated with simulated sample sizes n=10 000 and n=2000, respectively. For gene-level tests of association, we chose a modestly sized gene (MC4R) and a larger gene (ALK). Genotypes were simulated as previously described in Auer et al,20 with allele frequencies for non-synonymous variants taken from the Exome-Variant Server.

We evaluated single-variant associations using simple linear regression for every method except K–W. Gene-level associations were evaluated using the combined multivariate collapsing (CMC) burden test,12 the burden of rare-variants test20 (BRV, an adaptation of GRANVIL,21) and the sequence kernel association (SKAT)16 variance components test. Due to computational intensity, we did not evaluate the performance of SKAT under permutations. We also did not attempt to generalize the SKAT method with an M-estimator, so we do not report results for the HUBER method with SKAT. Note that SKAT is incompatible with K–W, therefore we did not evaluate its performance. Gene-level tests were implemented using a custom script in R for the CMC and BRV tests, and using the SKAT() function in R.

To evaluate the power, we assessed statistical significance at α=5 × 10−4 (this was the lowest significance level that we could implement across our simulations in a reasonable amount of time). Genetic effects were generated under the following additive genetic models: For single-variant analyses, we simulated the ith phenotypic value as Yi=Xiβ+ɛi, where Xi denotes the randomly generated genotype for the ith observation, β is the effect size and ɛi is the randomly generated error term (either from the mixture of normals for outliers or from a for non-normality).

For gene-level tests, we used a similar model with , where Xij is the randomly generated genotype for the ith observation at the jth variant site, and βj is the effect at the jth variant site. We chose βj as a 0.1*Bernoulli(p) random variable (when simulating fixed effects) or as a 0.1*Multinomial(1,2,3,4,5,6,7,8,9,10)*Bernoulli(p) random variable (when simulating variable effects). For the gene-level simulations, p was set to one of 0.1, 0.25, 0.5, 0.75, or 1, corresponding to the percent of causal variants within the gene.

Data analysis

To evaluate the various approaches for outliers and non-normality, we analyzed Exome-Chip genotypes from the Women’s Health Initiative (WHI).22 Our analyses focused on association testing for both circulating platelet counts (PLT) and white blood cell counts (WBC). These data have already been used in a meta-analysis that reported several robustly replicated rare-variant associations with PLT and WBC.3 Of the 161 808 participants in the WHI who were eligible and consented to genetic research, 18 513 were included in this analysis.

Blood counts were performed with automated hematology cell counters and standardized quality assurance procedures. WBC and PLT were recorded during the WHI baseline examination, conducted during 1993–1998. DNA samples were genotyped using the Illumina HumanExome v1.0 SNP array (Illumina, San Diego, CA, USA). Genotypes were assigned using GenomeStudio v2010.3 (Illumina). Markers with a genotyping success rate of less than 99% were excluded, as were samples with a genotyping success rate of less than 98%. Crytpic relatedness was assessed using the PLINK IBS/IBD functionality.23 For each related or duplicate pair of samples, we excluded the sample with the lower call rate. Samples with WBC >200 (× 109 cells/l) or PLT >1000 (× 109 cells/l) were excluded from the analysis, as these values are biologically implausible in healthy individuals.

Raw values for both PLT and WBC were regressed on age, genotyping batch, and the first two principal components. Neither PLT nor WBC are normally distributed. WBC is severely right skewed, and there are outliers for both PLT and WBC (Supplementary Figure S8), making them excellent phenotypes for illustrative purposes. The residuals from these regressions were either transformed (using INT, WINS, DEL, or LOG-NORM), and the transformed values were used for association testing, or the raw residuals were used for association testing with the K–W, PERMUTE, or HUBER approaches.

For testing single-variant associations, we considered single-nucleotide variants with a minor allele count >2. For gene-level association testing, we considered all missense, nonsense, or splice variants with an observed minor allele frequency ≤1%. Gene-level association testing was conducted with the CMC, BRV, and SKAT methods.

Results

Simulations

When there are outliers in the data, tests for rare-variant associations (both for single variants and for gene-level tests), suffer from inflated Type I error rates unless a correction is applied (Figure 1 and Supplementary Figure S1). We compared WINS, DEL, PERMUTE, INT, HUBER, and K–W to performing linear regression with outliers included (IGNORE). Ignoring outliers leads to inflation in Type I error for single-variant analyses, SKAT, and CMC (Figure 1 and Supplementary Figure S1, Table 1). Each method (WINS, DEL, PERMUTE, INT, HUBER, K–W) effectively controlled Type I error (Figure 1 and Supplementary Figure S1, Table 1).

Figure 1
figure 1

QQplots of rare-variant associations with a quantitative phenotype under the null hypothesis of no association. INT is shown in black, WINS in red, DEL in orange, PERMUTE in purple, K–W in brown, LOG-NORM in green, and HUBER in gray; ignoring outliers or ignoring non-normality is shown in blue. Ignoring outliers leads to inflation of Type I error for single-variant analyses with MAF=0.005, all of the corrections successfully controls Type I error (a). For non-normal phenotypes, ignoring, HUBER, and LOG-NORM lead to modest inflation of Type I error for single-variant analyses with MAF=0.005; all other corrections control Type I error (b). The CMC approach for rare variants in MC4R, does not show inflation of Type I error when outliers or non-normality are ignored (c, d, respectively). The SKAT approach for rare variants in MC4R shows modest inflation of Type I error in the presence of outliers (e) and non-normality (f).

Table 1 Type I error probabilities at significance levels of 5 × 10−2, 5 × 10−3, and 5 × 10−4

For data generated under a non-normal distribution, we compared INT, LOG-NORM, PERMUTE, HUBER, and K–W, to ignoring non-normality. When quantitative traits follow a distinctly non-normal distribution, we observed modest inflation of Type I error for single-variant analyses, SKAT, and CMC when non-normality is ignored. INT, PERMUTE, and K–W uniformly controlled Type I error across simulation settings. LOG-NORM and HUBER were only effective in some circumstances. (Figure 1, Supplementary Figures S1 and S2, Tables 1 and 2). The results were similar when we simulated non-normal trait distributions that also contained outliers (Table 2).

Table 2 Type I error probabilities at significance levels of 5 × 10−2, 5 × 10−3, and 5 × 10−4

Although not the primary aim of our study, we also considered the Type I error control under heteroskedasticity. Similar to the results reported in Beasley et al.24 we found that none of these methods (WINS, DEL, PERMUTE, INT, HUBER, K–W, LOG-NORM) were effective at controlling Type I error when genotype predicts the variance of the trait values (Supplementary Figure S2).

The methods we considered for controlling Type I error in the presence of outliers and non-normality demonstrated varying performance in their power to detect associations. When outliers are present in the data, DEL, and PERMUTE suffer a dramatic loss of power for single-variant analyses with MAF=0.005; HUBER, INT, K–W, and WINS were very similarly powered in this circumstance (Figure 2). The same was true for the CMC, BRV, and SKAT rare-variant tests (Figure 2, Supplementary Figure S6). We simulated different proportions of causal variants, as well as both fixed and random genetic effects for variants within a gene region. Changing these parameters did not affect the primary conclusion: that HUBER, INT, K–W, and WINS were all most powerful in detecting associations in the presence of outliers (Supplementary Figures S3).

Figure 2
figure 2

Power plots of rare-variant associations with a quantitative phenotype. Power is shown on the y axis, INT is shown in black, WINS in red, DEL in orange, PERMUTE in purple, K–W in brown, LOG-NORM in green, and HUBER in gray; ignoring outliers or ignoring non-normality is shown in blue. For single-variant analyses with MAF=0.005 and outliers, permutations and deletion of outliers suffer from a dramatic loss of power (a). Of note PERM displays the lowest power of all methods. When the phenotype violates normality, INT, HUBER, and K–W demonstrate the highest power to detect an association (b). For a and b, effect sizes (ie, beta values) are taken in terms of trait SD’s. For the CMC test in ALK, INT, HUBER, and K–W demonstrate the highest power with phenotypic outliers (c) or non-normal trait values (d). For the SKAT approach in ALK, INT, and WINS had the highest power to detect associations with phenotypic outliers (e); INT had the highest power to detect associations with non-normal trait values (f). For variants in ALK, we ran simulations with 10, 25, 50, 75, and 100% causal variants. All causal variants had the same effect size (ie, beta value) of 0.25 trait SD’s.

For non-normally distributed phenotypes, HUBER, K–W, and INT were most powerful in detecting associations across our simulation settings (Figure 2, Supplementary Figures S3 and S4, S6 and S7). Similar to the results for outliers, LOG-NORM and PERMUTE suffered a loss in power (Figure 2, Supplementary Figures S3 and S4, S6 and S7). Note that because HUBER and K–W are incompatible with SKAT, INT is the most powerful method for detecting associations using SKAT in the presence of a non-normally distributed phenotype.

Finally, we compared the power of the various approaches when phenotypes were simulated with error terms from the N(0,1) distribution. In this instance, one would expect that running a simple regression and ignoring any outliers or non-normality (IGNORE) would be most powerful. Indeed, we found that for MAF of 0.005 and 0.05 across a range of effect sizes, IGNORE was most powerful (Table 3). In comparison to IGNORE, the INT, PERM, HUBER, and LOG-NORM approaches did not suffer any notable loss of power; K–W, WINS, and DEL all displayed marked loss of power.

Table 3 Power results at significance levels of 5 × 10−4

Data analysis

Both PLT and WBC were analyzed for association at both the variant- and gene-level. For PLT and WBC, INGORE, LOG-NORM, and HUBER were ineffective at controlling Type I error for variants with MAF <5% (Supplementary Figure S9). INT, PERMUTE, and K–W most closely followed the diagonal line on the qqplots. For variants with MAF > 5% it is difficult to visually establish Type I error control from qqplots, because these are both highly polygenic traits with hundreds of underlying common variants. For the gene-level tests (CMC, BRV, and SKAT), INT, DEL, WINS, PERM, and HUBER demonstrated control of Type I error as displayed by qqplots that closely followed the identity line (Supplementary Figure S10).

To evaluate the power of these approaches to detect associations, we considered a number of true positive associations that have been robustly replicated in multiple studies. The CXCR2 gene harbors multiple rare, missense variants that are associated with WBC, show signal in gene-level tests of association, and are represented on the Exome-Chip.3 In addition, we considered three different variants that are each strongly associated with PLT (rs41303899 (TUBB1), rs3184504 (SH2B3), and rs148636776 (SH2B3)).3 As displayed in Table 4, INT shows the strongest signal for association between WBC and a burden of rare variants in CXCR2, although almost all of the methods detect the association at gene-level exome-wide significance (5 × 10−6). For PLT, the P-values appear similar across approaches, with the exception of PERM which shows the weakest signal for association (Table 4). Taken together, these results suggest that INT and WINS effectively control Type I error while picking up on true associations in a large-scale real data analysis of rare-variant associations.

Table 4 P-values from the analysis of WHI Exome-Chip data for rare-variant associations with WBC and PLT

Discussion

Although it has become a common approach for the analysis of GWAS data,25 there are reservations about the impact of INT on the results from association testing.24 Both Beasley et al.24 and Buzkova26 investigated the effect of INT when there is heteroskedasticity and demonstrated that Type I error was not well-controlled. Not surprisingly, they also report that for normally distributed traits, INT is less powerful than using untransformed data. Our results for large sample sizes and rare variants are consistent with these observations. Beasley et al.24 noted, ‘The intricacies of the differences among the power functions of the t-test and the t-test performed on INTs with different sample sizes, effect sizes, and error distributions need further investigation.’

In standard regression analyses, normality is often assessed on the residuals rather than the raw trait values.27 Indeed, covariates with large effects may induce a multi-modal trait distribution, which disappears after adjustment. In our analyses, we adopted the following approach: (1) regress the trait values on the set of covariates, (2) transform the residuals from this regression; and (3) test for association between the transformed residuals and the genetic variable of interest. Although this method suffers from a loss of power when covariates are correlated with the genetic variable of interest,28, 29 in the absence of such pathological correlation, we have found this to be a convenient and flexible approach for genetic association testing.

There are a number of considerations when deciding whether an untransformed phenotype is suitable for rare-variant association testing. As we did for the Exome-Chip analyses of WBC and PLT, outliers should be checked for biological plausibility. QQplots offer a powerful method for assessing normality. After regression of trait values on covariates, the ranked residuals can be plotted against the percentiles of a normal distribution. Although visually detecting deviation from the diagonal line is often sufficient, the Shapiro–Wilk test for normality30 is a more formal approach. Because it is not a strict assumption of these methods, minor deviation from normality can be tolerated. We recommend using a combination of a formal approach (such as the Shapiro–Wilk test) along with visual inspection of qqplots to assess whether the trait is ‘normal enough’ to conduct rare-variant association testing without a transformation (INT, WINS, DEL, and LOG-NORM) or alternate approach to testing (K–W, PERMUTE, and HUBER).

Under a variety of simulations, we found that INT effectively controlled Type I error and was the most powerful method, or very close to the most powerful method. For phenotypic outlier WINS and INT had comparable Type I and Type II error rates. However, phenotypic data may be both non-normal and contain outliers. In these cases, WINS only deals with the outliers, leaving the non-normality issue un-addressed. The INT is the only single approach we investigated that effectively deals with both outliers and non-normality simultaneously. Interestingly, PERMUTE was poorly powered in the presence of outliers or non-normality. Although PERMUTE is a gold standard method for controlling Type I error in genetic association studies, we recommend only using PERMUTE if the trait is approximately normally distributed and contains few, if any, outliers.

Unlike previous investigations,24 we evaluated power and Type I error at low alpha-levels. In addition, because many genome-wide genetic studies are using large sample sizes (even for rare-variant investigations)3, 6 our simulations featured large sample sizes as well. Rather than focusing our attention only on single-variant tests of association, we also investigated how aggregate rare-variant association tests (such as SKAT, CMC, and BRV) behave in the presence of phenotypic outliers or non-normality. For large-scale genome-wide studies for both common and rare variants, we recommend using INT or WINS as an effective means of correcting for trait outliers and INT for addressing non-normality.