Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Variance component model to account for sample structure in genome-wide association studies

Abstract

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Scatter plots of the first two principal components against latitude and longitude.
Figure 2: The genomic control parameters for ten traits change with the number of principal components used for adjustment.
Figure 3: Comparison of P value distributions across different methods with NFBC66 data.
Figure 4: Rank concordance comparison of strongly associated SNPs between different methods.
Figure 5: Distribution of the marker-specific inflation factors from NFBC66 data sets.

References

  1. Voight, B.F. & Pritchard, J.K. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 1, e32 (2005).

    Article  Google Scholar 

  2. Weir, B.S., Anderson, A.D. & Hepler, A.B. Genetic relatedness analysis: modern data and new challenges. Nat. Rev. Genet. 7, 771–780 (2006).

    Article  CAS  Google Scholar 

  3. Newman, D.L., Abney, M., McPeek, M.S., Ober, C. & Cox, N.J. The importance of genealogy in determining genetic associations with complex traits. Am. J. Hum. Genet. 69, 1146–1148 (2001).

    Article  CAS  Google Scholar 

  4. Helgason, A., Yngvadttir, B., Hrafnkelsson, B., Gulcher, J. & Stefnsson, K. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).

    Article  CAS  Google Scholar 

  5. Pritchard, J.K., Stephens, M., Rosenberg, N.A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

    Article  CAS  Google Scholar 

  6. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  7. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    Article  CAS  Google Scholar 

  8. Bacanu, S.A., Devlin, B. & Roeder, K. Association studies for quantitative traits in structured populations. Genet. Epidemiol. 22, 78–93 (2002).

    Article  Google Scholar 

  9. Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    Article  CAS  Google Scholar 

  10. Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  Google Scholar 

  11. Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).

    Article  CAS  Google Scholar 

  12. Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).

    Article  CAS  Google Scholar 

  13. Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).

    Article  CAS  Google Scholar 

  14. Cho, Y.S. et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat. Genet. 41, 527–534 (2009).

    Article  CAS  Google Scholar 

  15. Fisher, S.R.A. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52, 399–433 (1918).

    Article  Google Scholar 

  16. Ober, C., Abney, M. & McPeek, M.S. The genetic dissection of complex traits in a founder population. Am. J. Hum. Genet. 69, 1068–1079 (2001).

    Article  CAS  Google Scholar 

  17. Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).

    Article  CAS  Google Scholar 

  18. Zhao, K. et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3, e4 (2007).

    Article  Google Scholar 

  19. Kang, H.M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).

    Article  Google Scholar 

  20. Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    Article  CAS  Google Scholar 

  21. Rantakallio, P. Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatr. Scand. 193 (suppl.) 1–71 (1969).

    Google Scholar 

  22. Varilo, T. & Peltonen, L. Isolates and their potential use in complex gene mapping efforts. Curr. Opin. Genet. Dev. 14, 316–323 (2004).

    Article  CAS  Google Scholar 

  23. Jakkula, E. et al. The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 83, 787–794 (2008).

    Article  CAS  Google Scholar 

  24. Kariya, T. & Kurata, H. Generalized Least Squares (John Wiley & Sons, 2004).

  25. Chen, W.M. & Abecasis, G.R. Family-based association tests for genomewide association scans. Am. J. Hum. Genet. 81, 913–926 (2007).

    Article  CAS  Google Scholar 

  26. Lynch, M. & Walsh, B. Genetics and Analysis of Quantitative Traits (Sinauer, Sunderland, Massachusetts, 1998).

  27. Lowe, J.K. et al. Genome-wide association studies in an isolated founder population from the Pacific Island of Kosrae. PLoS Genet. 5, e1000365 (2009).

    Article  Google Scholar 

  28. Pilia, G. et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2, e132 (2006).

    Article  Google Scholar 

  29. Easton, D.F. et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007).

    Article  CAS  Google Scholar 

  30. Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579–584 (2009).

    Article  CAS  Google Scholar 

  31. Ahmed, S. et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 41, 585–590 (2009).

    Article  CAS  Google Scholar 

  32. Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).

    Article  Google Scholar 

  33. Kathiresan, S. et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 40, 189–197 (2008).

    Article  CAS  Google Scholar 

  34. Hinkley, D.V. Theoretical Statistics (CRC Press, Boca Raton, 1979).

  35. Whittemore, A.S. & Tu, I.P. Simple, robust linkage tests for affected sibs. Am. J. Hum. Genet. 62, 1228–1242 (1998).

    Article  CAS  Google Scholar 

  36. de Bakker, P.I.W. et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166–1172 (2006).

    Article  CAS  Google Scholar 

  37. Nejentsev, S. et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature 450, 887–892 (2007).

    Article  CAS  Google Scholar 

  38. Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638–645 (2008).

    Article  CAS  Google Scholar 

  39. Thornton, T. & McPeek, M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 81, 321–337 (2007).

    Article  CAS  Google Scholar 

  40. Guan, W., Liang, L., Boehnke, M. & Abecasis, G.R. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet. Epidemiol. 33, 508–517 (2009).

    Article  Google Scholar 

  41. Choi, Y., Wijsman, E.M. & Weir, B.S. Case-control association testing in the presence of unknown relationships. Genet. Epidemiol. 33, 668–678 (2009).

    Article  Google Scholar 

  42. Rakovski, C.S. & Stram, D.O. A kinship-based modification of the armitage trend test to address hidden population structure and small differential genotyping errors. PLoS One 4, e5825 (2009).

    Article  Google Scholar 

  43. Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).

    Article  CAS  Google Scholar 

  44. Kang, H.M., Ye, C. & Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 180, 1909–1925 (2008).

    Article  CAS  Google Scholar 

  45. Irizarry, R.A. et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods 2, 345–350 (2005).

    Article  CAS  Google Scholar 

  46. Marchini, J., Donnelly, P. & Cardon, L.R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 413–417 (2005).

    Article  CAS  Google Scholar 

  47. Evans, D.M., Marchini, J., Morris, A.P. & Cardon, L.R. Two-stage two-locus models in genome-wide association. PLoS Genet. 2, e157 (2006).

    Article  Google Scholar 

  48. Falconer, D.S. & Mackay, T.F.C. Introduction to Quantitative Genetics 4th edn. (Longman, 1996).

  49. Lange, K. Mathematical and Statistical Methods for Genetic Analysis (Springer, 2002).

  50. Lynch, M. & Ritland, K. Estimation of pairwise relatedness with molecular markers. Genetics 152, 1753–1766 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Epstein, M.P., Duren, W.L. & Boehnke, M. Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67, 1219–1231 (2000).

    Article  CAS  Google Scholar 

  52. Thomas, S.C. & Hill, W.G. Estimating quantitative genetic parameters using sibships reconstructed from marker data. Genetics 155, 1961–1972 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Ritland, K. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. 67, 175–185 (2009).

    Article  Google Scholar 

  54. McPeek, M.S. & Sun, L. Statistical tests for detection of misspecified relationships by use of genome-screen data. Am. J. Hum. Genet. 66, 1076–1094 (2000).

    Article  CAS  Google Scholar 

  55. Milligan, B.G. Maximum-likelihood estimation of relatedness. Genetics 163, 1153–1167 (2003).

    PubMed  PubMed Central  Google Scholar 

  56. Maher, B. Personal genomes: the case of the missing heritability. Nature 456, 18–21 (2008).

    Article  CAS  Google Scholar 

  57. McArdle, B.H. & Anderson, M.J. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82, 290–297 (2001).

    Article  Google Scholar 

  58. McCulloch, C.E. Generalized Linear Mixed Models (Institute of Mathematical Statistics, Alexandria, Virginia, and American Statistical Association, Beachwood, Ohio, 2003).

  59. Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

    Article  Google Scholar 

  60. Agresti, A. & Wiley, J. Categorical Data Analysis (Wiley, New York, 1990).

Download references

Acknowledgements

We thank the NFBC66 team for access to phenotype and genotype data used in the analyses presented here. The genotype data were generated at the Broad Institute with support from National Heart, Lung, and Blood Institute grant 6R01HL087679-03. We thank D. Clayton for reading through the manuscript and for providing important suggestions. We acknowledge the WTCCC for allowing us to use their data set. H.M.K., N.A.Z., J.H.S. and E.E. are supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.

Author information

Authors and Affiliations

Authors

Contributions

H.M.K., J.H.S., C.S. and E.E. designed the methods and experiments; H.M.K., J.H.S., S.K.S., S.-y.K., N.B.F., C.S. and E.E. jointly analyzed the NFBC66 data set; H.M.K., J.H.S., N.A.Z., C.S. and E.E. jointly analyzed the WTCCC data set; H.M.K., J.H.S., S.K.S., N.B.F., C.S. and E.E. wrote the manuscript; all authors contributed their critical reviews of the manuscript during its preparation.

Corresponding authors

Correspondence to Chiara Sabatti or Eleazar Eskin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–3, Supplementary Figures 1–6 and Supplementary Note (PDF 2666 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, H., Sul, J., Service, S. et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42, 348–354 (2010). https://doi.org/10.1038/ng.548

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.548

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing