SNP and haplotype variation in the human genome

https://doi.org/10.1016/S0027-5107(03)00014-9Get rights and content

Abstract

We have surveyed and summarized several aspects of DNA variability among humans. The variation described is the result of mutation followed by a combination of drift, migration and selection bringing the frequencies high enough to be observed. This paper describes what we have learned about how DNA variability differs among genes and populations. We sequenced functional regions of a set of 3950 genes. DNA was sampled from 82 unrelated humans: 20 African-Americans, 20 East Asians, 21 Caucasians, 18 Hispanic-Latinos and 3 Native Americans. Different aspects of variability showed a great deal of concordance. In particular, we studied patterns of single nucleotide polymorphism (SNP) allele and haplotype sharing among the four, large sample populations. We also examined how linkage disequilibrium (LD) between SNPs relates to physical distance in the different populations. It is clear from our findings that while many variants are common to all populations, many others have a more restricted distribution. Research that attempts to find genetic variants that explain phenotypic variants must be careful in their choice of study population.

Introduction

With the advent of the first draft of the human genome sequence [1], [2], research into the genetic causes of phenotypic differences among humans has been brought to a new scale. To plan such research, it is vital to understand the characteristics of the genetic variation that exists. Genaissance Pharmaceuticals has undertaken a broad and deep survey of that variation. Our interests are directed primarily towards explaining differential drug response among patients, yet the patterns we have uncovered may lend insight into other areas of study. Stephens et al. [3] detailed results from surveying 313 genes. Schneider et al. [4] extended that analysis to over 2000 genes. In this paper, we update these analyses based on the study of 3950 genes and their single nucleotide polymorphism (SNPs) and haplotypes.

Given the enormity of the human genome, one must usually place filters on which areas to investigate. One logical starting point is to restrict analysis to known and putative genes. We have taken that approach. Beyond eliminating extragenic regions, we have also focused in on the likely functional regions of each gene. Specifically, we approach SNP and haplotype discovery by sequencing exons, 100 bp of introns on each side of each exon, and up to 1 kb upstream and 100 bp downstream of each gene. In this way, functionally important variants in coding, intron–exon splice junction, and proximal promoter regions are likely to be discovered [5].

As the results reported below show, it is crucial in studying genetic variation to sample subjects of diverse ethnogeography. Chromosomes sampled from different populations typically have meaningful, and unpredictable differences. With this understanding in mind, we sampled from 82 unrelated humans with the following ethnogeographic self-identifications: 21 Caucasians (CA), 20 African-Americans (AF), 20 Asians (AS), 18 Hispanic-Latinos (HL), and 3 Native Americans (NA). Most of these subjects described their parents and grandparents as having the same background. In addition to these individuals, we also included a three-generation European-American (CEPH) family (four grandparents, two parents, four offspring) and a two-generation African-American family (two parents, five offspring)—the eldest generation of these families contributed to the unrelated totals above. All sequencing, genotyping, SNP, and haplotype results will refer to this sample of individuals unless otherwise stated.

The genes in this study represent all manner of biochemical and physiological function. Because of the focus of Genaissance on discovering genetic predictors of drug response, the gene list has that overall flavor. The genes include drug targets and proteins involved in disease, metabolism, absorption, excretion, and transport. Additional genes have been processed because of functional, animal model, or evolutionary considerations. Furthermore, knowledge of the exon–intron structure of such genes was an important criterion in prioritizing the genes we sequenced for SNP and haplotype discovery. Despite the method of prioritization, these genes, which represent roughly 10% of the currently estimated number of human genes, seem a good set for studying patterns of human genetic variation.

Section snippets

SNPs

Single nucleotide polymorphisms are an atomic form of genetic variation [6]. They can be discovered through shotgun sequencing, or, as we have, through resequencing of targeted genomic regions. After discovery, SNPs may be assembled into haplotypes as discussed in Section 3 to examine allelic variation at a larger scale, such as the level of the whole gene. SNPs can be measured easily through many “genotyping” laboratory technologies.

SNPs are of interest for a variety of reasons. First, a SNP,

Haplotypes

A haplotype is simply the set of polymorphism alleles that co-occur on a chromosome. We have estimated haplotypes statistically for the SNPs in each gene using the program HAP™ Builder [15].

The patterns of haplotype diversity and sharing in and among populations closely parallel those described for SNPs. Furthermore, the number of haplotypes for a gene is strongly correlated with the number of SNPs. Fig. 4 depicts this relationship. In the absence of recombination, gene conversion, and

Mutation and selection

Beyond looking at overall diversity characteristics of these polymorphisms, we can also learn about mutation and selection by studying differences between classes of change. Fig. 6 shows the counts of each of the 12 classes of base changes where the change is specified as common allele/rare allele. Two obvious trends are observed. First, transitions greatly outnumber transversions. Second, the G and C alleles tend to be the major alleles and A and T the minor ones, by a ratio as great as 2:1

Conclusions

A tremendous amount of variability exists in the human genome, even within the functional regions of genes. Researchers who intend to correlate specific phenotypes (e.g. disease susceptibility or variable drug response) to genomic variation must be aware of how this variation is organized and distributed among genes, gene regions, and populations. We have now surveyed these patterns in at least 10% of all known human genes. We can generalize that while recombination plays a role in generating

References (23)

  • M.K. Halushka et al.

    Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis

    Nat. Genet.

    (1999)
  • Cited by (129)

    • First genome-wide association study and genomic prediction for growth traits in spotted sea bass (Lateolabrax maculatus) using whole-genome resequencing

      2023, Aquaculture
      Citation Excerpt :

      However, significant SNPs detected in two populations were quite different, and they were also inconsistent with SNPs detected by QTL mapping for growth traits of spotted sea bass by using 2b-RAD method (Liu et al., 2020). In addition to methodological differences between GWAS and QTL mapping, population structure, genetic relatedness, genotyping strategy, marker density and some other factors may affect the final outcome (Rosenberg et al., 2010; Salisbury et al., 2003; Wu et al., 2019). Compared with our previous study of QTL mapping using 6883 SNPs generated from 333 F1 individuals in a full-sib family (Liu et al., 2020), in the current study, much higher density of SNP markers (>4 million) and complex population structure could detect more genetic variants associated with growth traits.

    • Introduction to genetics of sport and exercise

      2019, Sports, Exercise, and Nutritional Genomics: Current Status and Future Directions
    • Null alleles and sequence variations at primer binding sites of STR loci within multiplex typing systems

      2018, Legal Medicine
      Citation Excerpt :

      Furthermore, null alleles might be overlooked or be mistaken as mutation events in paternity testing. Therefore, the number of null alleles observed in previous reports should be underestimated [48]. Previous reports also reveal sequence variations at primer binding sites (Table S3).

    View all citing articles on Scopus
    View full text