Population-specific genetic variation in large sequencing data sets: why more data is still better

van Rooij, Jeroen G J; Jhamai, Mila; Arp, Pascal P; Nouwens, Stephan C A; Verkerk, Marijn; Hofman, Albert; Ikram, M Arfan; Verkerk, Annemieke J; van Meurs, Joyce B J; Rivadeneira, Fernando; Uitterlinden, André G; Kraaij, Robert

doi:10.1038/ejhg.2017.110

Download PDF

Article Report
Published: 19 July 2017

short report

Population-specific genetic variation in large sequencing data sets: why more data is still better

Jeroen G J van Rooij^1,2,
Mila Jhamai¹,
Pascal P Arp¹,
Stephan C A Nouwens¹,
Marijn Verkerk ORCID: orcid.org/0000-0002-7476-3700¹,
Albert Hofman^3,4,
M Arfan Ikram ORCID: orcid.org/0000-0003-0372-8585^2,3,
Annemieke J Verkerk¹,
Joyce B J van Meurs¹,
Fernando Rivadeneira ORCID: orcid.org/0000-0001-9435-9441¹,
André G Uitterlinden^1,2 &
…
Robert Kraaij¹

European Journal of Human Genetics volume 25, pages 1173–1175 (2017)Cite this article

4155 Accesses
15 Citations
1 Altmetric
Metrics details

Subjects

Abstract

We have generated a next-generation whole-exome sequencing data set of 2628 participants of the population-based Rotterdam Study cohort, comprising 669 737 single-nucleotide variants and 24 019 short insertions and deletions. Because of broad and deep longitudinal phenotyping of the Rotterdam Study, this data set permits extensive interpretation of genetic variants on a range of clinically relevant outcomes, and is accessible as a control data set. We show that next-generation sequencing data sets yield a large degree of population-specific variants, which are not captured by other available large sequencing efforts, being ExAC, ESP, 1000G, UK10K, GoNL and DECODE.

A deep catalogue of protein-coding variation in 983,578 individuals

Article 20 May 2024

High-resolution genome-wide mapping of chromosome-arm-scale truncations induced by CRISPR–Cas9 editing

Article Open access 29 May 2024

Identifying proteomic risk factors for cancer using prospective and exome analyses of 1463 circulating proteins and risk of 19 cancers in the UK Biobank

Article Open access 15 May 2024

Introduction

In the era of next-generation sequencing (NGS), the use of large population data sets to approximate variant frequencies in control populations has become common practice. The first large population-scale sequencing data set was generated by the 1000 Genomes Project,¹ where an integrated genome-wide map of genetic variation was established for 2504 individuals of European, American, African and Asian descent. Another approach was made by the NHLBI ‘Grand Opportunity’ Exome Sequencing Project, in which a set of 6500 European and African Americans samples was exome sequenced.² The recent Exome Aggregation Consortium (ExAC) is now combining exome sequencing data sets from over 60 000 unrelated individuals from different origins.³ From these large sequencing projects, it became apparent that many variants are population-specific.³ Therefore, several initiatives have generated more local data sets. The UK10K project⁴ contains 4000 genomes from the UK, along with 6000 exomes from individuals with selected extreme phenotypes. A collection of 3000 Finnish exomes, showed that the Finnish population had more loss-of-function variants and gene knock-outs than non-Finish Europeans.⁵ GoNL,⁶ the Dutch reference genome project, provided a local genetic map based on whole-genome sequencing of 250 Dutch trios.⁷ Another local data set is based on full genomes from 2636 Icelanders.⁸ Due to Iceland being an isolated population, deleterious variants could reach higher frequencies than in other populations. These initiatives emphasize the importance of local genetic maps to interpret clinical relevance of a potential disease-causing mutation, and indicate the differences in available population data sets that should be considered when these are used in research or clinical practice.

Within the Rotterdam Study cohort, a prospective population-based cohort study on individuals 45 years and older to investigate determinants of disease and disability in the Dutch population,⁹ we have generated a set of 2628 exomes for integrative genetic studies of diverse phenotypes and to serve as a local reference panel for clinical sequencing efforts.

Materials and methods

DNA samples were obtained from the Rotterdam Study, which is a prospective population-based cohort study established in 1990 studying the determinants of disease and disability in Dutch elderly individuals.⁹ Out of 5984 eligible participants from the RS-I cohort − based on the availability of height, weight, GWAS data and informed consent − 3284 subjects were randomly selected, as shown in Figure 1. Baseline characteristics are provided in Supplementary Table 1.

Genomic DNA was prepared from whole blood and processed using the Illumina TruSeq DNA Library preparation (Illumina, Inc., San Diego, CA, USA), followed by exome capture using the Nimblegen SeqCap EZ V2 kit (Roche Nimblegen, Inc., Madison, WI, USA). Paired-end 2 × 100 bp sequencing was performed at six samples per lane on Illumina HiSeq2000 sequencer using Illumina TruSeq V3 chemistry.

Reads were demultiplexed and aligned to the human reference genome hg19 (UCSC, Genome Reference Consortium GRCh37) using the Burrows-Wheeler alignment tool (BWA version 0.7.3a¹⁰). After indel realignment and base quality score recalibration using the Genome Analysis ToolKit (GATK version 2.7.4¹¹) and masking of duplicates (Picard Tools version 1.90¹²), gvcf files were generated using HaplotypeCaller v3.1.1 (GATK) and genotyped using GenotypeGVCFs v3.1.1 (GATK).¹¹ Raw genotype data was QC-ed and filtered as described in the Supplementary Information. All coding variants used in analysis are available on the European Variation Archive (http://www.ebi.ac.uk/eva/) under accession number PRJEB20726.

All detected variants were annotated based on RefSeq annotation (NCBI Reference Sequence Database) using ANNOVAR (version 2014-07-14¹³). The presence and allele frequencies of these variants in various databases: 1000G (v3),¹ ESP (v2),² ExAC (v0.3),³ UK10K (v1407),⁴ DECODE (v1501)⁸ and the Genome of the Netherlands (v4)⁶ were obtained and compared to our data set.

Results

Two thousand six hundred and twenty eight samples passed technical and genetic quality control and were included in the data set (Figure 1), with an average mean depth of coverage of 55x (range 20x to 185x, median coverage of 53x). A total of 669 737 single-nucleotide variants (SNVs) and 24 019 short insertions or deletions (indels) were detected, this data set was denoted Rotterdam Study Exome Sequencing set 2 (RSX2). Of all 669 737 SNVs detected in our RSX2 data set, 439 633 (66%) were exonic. Of these, 120 677 (27.4%) were not detected in any other public database (ExAC2.0, ESP6500, 1000G, UK10K, DECODE and GoNL), as shown in Figure 2. Most of these variants (120 179; 99.6%) were found at a minor allele frequency (MAF) below 1% in our data set, 65 324 were singletons (54%) and 19 870 were doubletons (17%). The largest overlap with a single data set was with ExAC2.0 (71% of 439 633 SNVs), followed in descending order by ESP6500 (46%), 1000G (36%), UK10K (34%), GoNL (26%) and DECODE (22%).

Discussion

From 439 633 detected coding variants, 120 179 were absent from all six other population databases. A portion of this absence can be attributed to various biological (ie, ethnical backgrounds, isolated populations or case-series) and technical (whole-genome sequencing, exome capturing or filtering strategies and sequencing depth) differences, the remainder is most likely due to population-specific variance.

The smallest overlap with DECODE is partly due to the lower sequencing depth and stronger filtering strategy in that data set, resulting in fewer variants in general. In addition, the genetically isolated status of the Icelandic population warrants fewer genetic variability and smaller overlap with RSX2.⁸ Despite originating from a similar population, the small overlap with the GoNL database is likely due to its small sample size, reducing power to detect rare variants.⁶ A larger overlap with UK10K was observed as a result of its large sample size and related population. The differences with the UK10K data set are largely due to population-specific differences and, the selection of individuals with extreme phenotype in UK10K.⁴ The 1000G data set holds many more variants than RSX2, probably caused by whole-genome sequencing coverage on coding regions inaccessible by whole-exome sequencing, and by the presence of non-Caucasian individuals.¹ Similarly, difference in populations and sample size leads to the ESP6500 data set to be larger than RSX2, although the selection for various case-populations might also be of influence.² Finally, the greatest data set of ExAC2.0 contains most variants, as a result of much larger sample size and the inclusion of many different populations.³

Each data set present in this comparison contained variants not present in any of the other data sets. These results suggest that, for example, when filtering or interpreting genetic variants in a WES analysis of a Mendelian disease pedigree, both smaller population-specific data sets (such as, RSX2, GoNL, UK10K and/or DECODE) as well as large aggregation data sets (such as, ExAC) contribute information and should be used jointly to filter. Additionally, each database contributes variants not seen elsewhere, suggesting that as many databases as eligible should be considered in these types of analyses. When WES data sets are to be used as controls (eg, in a case control comparison) note should be taken that some data sets such as UK10K, ESP and ExAC2.0, contain large collections of case-series^{2, 3, 4} and will not provide a good representation of DNA sequence variants of any allele frequency spectrum in the normal population. Given their design and collection strategy, population-based data sets such as RSX2, DECODE and GoNL, might be better suited for this purpose, depending on the diseases and traits studies and their estimated prevalence in these databases.

References

Genomes Project C Genomes Project C, Abecasis GR Genomes Project C, Altshuler D Genomes Project C, Auton A Genomes Project C, Brooks LD Genomes Project C, Durbin RM et al: A map of human genome variation from population-scale sequencing. Nature 2010; 467: 1061–1073.
Article Google Scholar
Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S et al: Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 2012; 337: 64–69.
Article CAS Google Scholar
Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T et al: Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016; 536: 285–291.
Article CAS Google Scholar
UK10K WTSI, Hinxton, UK. Available at: http://www.uk10k.org [june-2015].
Lim ET, Wurtz P, Havulinna AS, Palta P, Tukiainen T, Rehnstrom K et al: Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet 2014; 10: e1004494.
Article Google Scholar
Boomsma DI, Wijmenga C, Slagboom EP, Swertz MA, Karssen LC, Abdellaoui A et al: The Genome of the Netherlands: design, and project goals. Eur J Hum Genet 2014; 22: 221–227.
Article CAS Google Scholar
Genome of the Netherlands C: Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 2014; 46: 818–825.
Article Google Scholar
Gudbjartsson DF, Helgason H, Gudjonsson SA, Zink F, Oddson A, Gylfason A et al: Large-scale whole-genome sequencing of the Icelandic population. Nat Genet 2015; 47: 435–444.
Article CAS Google Scholar
Hofman A, Brusselle GG, Darwish Murad S, van Duijn CM, Franco OH, Goedegebure A et al: The Rotterdam Study: 2016 objectives and design update. Eur J Epidemiol 2015; 30: 661–708.
Article Google Scholar
Li H, Durbin R : Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010; 26: 589–595.
Article Google Scholar
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20: 1297–1303.
Article CAS Google Scholar
http://broadinstitute.github.io/picard/.
Wang K, Li M, Hakonarson H : ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010; 38: e164.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Internal Medicine, Erasmus MC, Rotterdam, Netherlands
Jeroen G J van Rooij, Mila Jhamai, Pascal P Arp, Stephan C A Nouwens, Marijn Verkerk, Annemieke J Verkerk, Joyce B J van Meurs, Fernando Rivadeneira, André G Uitterlinden & Robert Kraaij
Department of Neurology, Erasmus MC, Rotterdam, Netherlands
Jeroen G J van Rooij, M Arfan Ikram & André G Uitterlinden
Department of Epidemiology, Erasmus MC, Rotterdam, Netherlands
Albert Hofman & M Arfan Ikram
Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Albert Hofman

Authors

Jeroen G J van Rooij
View author publications
You can also search for this author in PubMed Google Scholar
Mila Jhamai
View author publications
You can also search for this author in PubMed Google Scholar
Pascal P Arp
View author publications
You can also search for this author in PubMed Google Scholar
Stephan C A Nouwens
View author publications
You can also search for this author in PubMed Google Scholar
Marijn Verkerk
View author publications
You can also search for this author in PubMed Google Scholar
Albert Hofman
View author publications
You can also search for this author in PubMed Google Scholar
M Arfan Ikram
View author publications
You can also search for this author in PubMed Google Scholar
Annemieke J Verkerk
View author publications
You can also search for this author in PubMed Google Scholar
Joyce B J van Meurs
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Rivadeneira
View author publications
You can also search for this author in PubMed Google Scholar
André G Uitterlinden
View author publications
You can also search for this author in PubMed Google Scholar
Robert Kraaij
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Kraaij.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

A Supplementary video accompanies this paper on European Journal of Human Genetics website

Supplementary information

Supplementary Methods (DOCX 26 kb)

Supplementary Figure 1 (JPG 149 kb)

Supplementary Figure 2 (JPG 132 kb)

Supplementary Table 1 (DOCX 16 kb)

Supplementary Information (DOCX 12 kb)

Supplementary Movie (MP4 73693 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

van Rooij, J., Jhamai, M., Arp, P. et al. Population-specific genetic variation in large sequencing data sets: why more data is still better. Eur J Hum Genet 25, 1173–1175 (2017). https://doi.org/10.1038/ejhg.2017.110

Download citation

Received: 11 July 2016
Revised: 25 April 2017
Accepted: 13 June 2017
Published: 19 July 2017
Issue Date: October 2017
DOI: https://doi.org/10.1038/ejhg.2017.110

This article is cited by

Chinese genetic variation database of inborn errors of metabolism: a systematic review of published variants in 13 genes
- Yongchao Guo
- Jianhui Jiang
- Zhongyao Xu
Orphanet Journal of Rare Diseases (2023)
In-silico mining to glean SNPs of pharmaco-clinical importance: an investigation with reference to the Indian populated SNPs
- Anamika Yadav
- Shivani Srivastava
- Pramod Katara
In Silico Pharmacology (2023)