Introduction

The continued ability to digest lactose after weaning varies among humans and it is particularly common among populations that have traditionally practiced cattle herding. The capacity of continued lactose digestion or lactase persistence (LP) is conferred to humans by a few mutations in a cis-acting control element of the LCT gene that encodes the lactase-phlorizin hydrolase enzyme (LPH).1, 2, 3 It has been shown that particular variants at SNPs in the introns of an adjacent gene (MCM6) prevent the downregulation of LPH in adults.4, 5, 6, 7 One of these SNPs (13910C-T or rs4988235) has likely been under strong selection in some European populations.8, 9, 10 Especially Northern Europeans show high frequencies of this mutation and simultaneously show high levels of LP. The frequency of this SNP-variant and the ability for adults to digest lactose decreases towards southern Europe and the Middle East and is low in North Africa.3, 6, 11 This particular mutation is at very low frequency or absent in sub-Saharan African populations even though some groups, such as East African pastoralists, show a high prevalence of LP.12

Subsequent candidate gene studies showed that a different polymorphism (14010G-C), 100 bp downstream from the SNP-variant that causes LP in Northern Europeans, which occurred on a different haplotypic background, was strongly linked to LP in various East African groups and that there was a strong signal for selection in some of these populations.7 The frequency of this variant varies between different East African groups and occurs at frequencies of 39 and 32% in Nilo-Saharans from Tanzania and Kenya and at frequencies of 46 and 18% in Afro-Asiatic groups from these two countries. The frequencies are lower in the Sandawe (13%) and absent in the Hadza hunter-gatherers from East Africa, as well as in various Sudanese populations.7 However, patterns of genome-wide genetic variation and linkage disequilibrium in East African populations remain poorly studied, and the signature of recent selection seen around the LCT locus in Nilo-Saharans and Afro-Asiatic groups has not yet been compared with other parts of the genome in these populations. West African farmers (such as the Yoruba from Nigeria) central and southern African hunter-gatherers, and East African Bantu-speaking groups show no signature of selection at the LCT locus.10, 13, 14 Both of these LP polymorphisms (European −13910C-T and East African −14010G-C) have been directly attributed to the enhancement of transcription of the LCT gene, by means of binding affinity and reporter gene assays.15, 16, 17 In addition to the two polymorphisms mentioned above, three other polymorphisms within the adjacent MCM6 gene have also been linked to the lactase persistent trait in specific groups of people. The compound −13712C, −13915G allele has a role in LP in the Middle East,4 while the −22018G-A SNP is linked to the trait in certain northern European populations5 and the −13907C-G SNP is linked to the trait in some Sudanese populations.7 Although the function of MCM6 is unrelated to the LCT gene function and the LP trait, it contains two of the regulatory regions for LCT, located in two of the MCM6 introns, 14 kb (most of the LP polymorphisms) and 22 kb (the −22018G-A variant) upstream of the LCT gene (a summary can be found at http://omim.org/entry/601806).

In this study, we performed genome-wide scans for recent positive selection in the HapMap Maasai population,18 including the region on chromosome 2q21 where the LCT and MCM6 genes are located and compared results with other HapMap populations. We found that signatures of recent selection at the LCT/MCM6 gene-region are the strongest across the genome in the Maasai population. Furthermore, the signals of recent positive selection around the LCT gene are stronger in the Maasai than in the CEU population, which can be caused by stronger selection pressure in the Maasai, more recent selection in the Maasai, or different demographic history of the Maasai and the CEU.

Materials and methods

We obtained phased genotype data comprising 1 387 465 autosomal SNPs from HapMap III18 for 204 individuals from 7 HapMap populations: CEU, TSI, MKK, LWK, YRI, JPT, and CHB (Downloaded 30 Nov, 2010: ftp://ftp.ncbi.nlm.nih.gov/hapmap/phasing/2009-02_phaseIII/HapMap3_r2/). We used Chimpanzee alleles from panTro219 in an alignment with the human genome20 to determine the ancestral SNP-variant. A genetic map was also retrieved from the 1000 genomes data (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20110217_M broad_omni_genotypes/). We retained a final set of 466 614 SNPs that had both a genetic map position and the ancestral SNP-variant inferred from the Chimpanzee genome.

Integrated haplotype statistic (iHS) values,13 were calculated with the software iHS (http://hgdp.uchicago.edu/Software/) for the MKK, CEU, TSI, LWK and YRI groups. For a window-based measure ‘winiHS’, we calculated the mean of the absolute value of iHS in a window of 30 consecutive SNPs and with a 15 SNP jump length between windows.

The population branch statistic (PBS) was computed according to Yi et al.21 For each branch in the unrooted population topology connecting MKK, CEU and HapMap East Asians (JPT+CHB), the PBS statistic was computed using an allele frequency based estimate of FST.22 The average PBS in windows of 30 SNPs and step-length 15 SNPs was calculated based on the same SNPs as in the iHS-analysis, and standardized by subtracting the mean and dividing by the SD (of the window values) resulting in a statistic that we refer to as ‘window-based measure of PBS’ (winPBS).

A Median Joining Network (with Maximum Parsimony post-processing) was constructed using Network v.4.6.0.023, 24 for a 100 kb region encompassing the MCM6 gene, which contained 60 SNPs in the HapMap III data.

Results

We scanned the genome of five HapMap populations (MKK – Maasai, Nilo-Saharan speakers from Kenya in East Africa (n=87); CEU – western European ancestry (n=17); TSI – Tuscan from Italy (n=88); YRI – Yoruba, Niger-Kordofanian speakers from Nigeria in West Africa (n=9), and LWK – Luhya, Bantu-speakers from Kenya in East Africa (n=90)), for regions of extended haplotype homozygosity using iHS,13 which can detect selective sweeps that have not yet reached fixation in a population. Subsequently, we calculated a window-based statistic (winiHS) as explained in the ‘Materials and Methods’ section. The choice of window size had little impact on the result, see Supplementary Figure S1.

Across the entire genome for the MKK, the strongest winiHS signal was found in the chromosomal region where the LCT gene is located (Figure 1a). Furthermore, the top 16 strongest genome-wide winiHS signals were confined to this region. The signal in this region was absent in all populations except for MKK and CEU (Figure 1b, Supplementary Figure S2, Supplementary Table S1). The strongest signal in the TSI, the YRI and the LWK was found at the MHC region (at 30 Mb on chromosome 6, 6p22.1), but the MKK and the CEU also showed strong signals in this region (Supplementary Figure S2, Supplementary Table S1). The top 20 winiHS peaks in all screened populations are shown in Supplementary Table S1. The signal in the LCT/MCM6 region has only a slightly longer extension in MKK (4.16 Mb) than CEU (3.3 Mb) but the peak winiHS value in MKK was about twice as high as in CEU (Figure 1b). While sample size differences between populations are a potential concern, the signal in MKK remained stronger than in CEU when down-sampling MKK to 34 haploid genomes (the same as for CEU) (Supplementary Figure S3).

Figure 1
figure 1

Selection scans. (a) winiHS across the genome (see text) of the MKK population. Light blue corresponds to odd-numbered chromosomes while even-numbered chromosomes are colored dark blue. The LCT/MCM6 region on chromosome 2 and the MHC regions on chromosome 6 are marked by horizontal lines. (b) Close-up of winiHS in the lactase region (position 125 Mb to 145 Mb on chromosome 2). The numbers correspond to the rank across the entire genome of each population of the winiHS for the SNPs (the top 20 SNPs are shown). The gray vertical line marks the region of LCT/MCM6. (c) Close-up of the winPBS in the lactase region (position 125 Mb to 145 Mb on chromosome 2). The numbers correspond to the rank across the entire genome (the top 10 SNPs are shown).

To study the possible impact of recent positive selection in the East African and European populations using an alternative approach, we employed a method based on searching for unusually differentiated genomic regions using the PBS.21 A winPBS-value was computed as explained in the Methods section. This statistic revealed an unusually high differentiation of the MKK and CEU around the LCT region (Figure 1c), for which the peak winPBS values were greater in the MKK than in the CEU. The winPBS value in this region was the third strongest across the entire genome in the MKK sample.

Subsequently we focused on a 100 kb region encompassing the MCM6 gene, which contained 60 SNPs in the HapMap 3 data. A direct comparison of the haplotypes in the CEU subset and the MKK subset indicated that both these populations contained one specific high frequency haplotype, and that these two haplotypes differed substantially from each other. To visualize related haplotypes, we constructed a Median Joining Network23, 24 (Figure 2). The European LP variant (13910C-T at rs4988235) coincided with the most frequent CEU haplotype. The SNP associated with LP in East Africa7 was not present in our filtered data set nor in the complete HapMap3 data, but we identified a haplotype that putatively contain the East African LP causing variant. This haplotype is the most common haplotype (65.5%) in the Maasai (the second most common haplotype had a frequency of 7.5% and frequencies of the remaining 27 haplotypes were all below 3%). Owing to the exceptionally strong signal for selection that we observe, it is unlikely that any of these lower frequency haplotypes underlies the LP trait in the Maasai group. Furthermore, Tishkoff et al7 found the LP trait to be at frequencies of 71% and 59% in the Kenyan and Tanzanian Maasai, respectively (the frequency of the suggested HapMap Maasai LP haplotype is intermediate to these frequencies), and identified the −14010G-C mutation in 58% and 44.7% of the two respective groups. Finally, the genome of one of the HapMap Maasai individuals carrying the putative Maasai LP haplotype (NA21733), has been sequenced by Complete Genomics (http://www.completegenomics.com/). For the 100 kb region encompassing the MCM6 gene, this individual is homozygous for the most frequent Maasai haplotype (based on the HapMap data) and at the East African LP SNP, the individual carried one copy of the LP variant (−14010C) and one copy of the non-LP variant (−14010G). For these reasons, it is likely that the East African −14010C LP causing variant occurs (very often) on this high frequency haplotype-background in the Maasai. The other four identified LP SNPs were either absent or at very low frequencies in the two Maasai groups studied by Tishkoff et al.7

Figure 2
figure 2

Haplotype network and frequency table for 60-SNP haplotypes encompassing the LCT and MCM6 loci. The network shows the relationship of the 60-SNP haplotypes in five selected HapMap3 populations. The CEU LP haplotype is indicated in the figure, as defined by the 13910C-T mutation (rs4988235). The inset-table shows the frequencies of the CEU LP haplotype, the MKK LP haplotype and other haplotypes in five different HapMap populations. The average PBS for the 60 SNPs had a greater value for MKK (0.57) than for CEU (0.43), where the genome average PBS values were 0.071 for MKK and 0.036 for CEU.

The East African haplotype putatively associated with LP in the Maasai, also occurred at high frequency (31.3%) in the HapMap Tuscans (TSI), three times as common as the northwestern European LP haplotype (10.2%, Figure 2).

Discussion

The LP phenotype confers a great advantage to individuals that live in pastoralist societies as it allows access to a new sustenance niche that would have otherwise been inaccessible. In addition, milk as a food source is more sustainable than meat production, with no need to cull in order to access the food. The HapMap Maasai population from Kenya is an East African pastoralist population that relies heavily on milk consumption as a food source, in addition to meat and blood.25, 26 Although meat is considered an important food source among the Maasai, it is consumed infrequently as personal wealth is measured in terms of cattle. In such a subsistence- and cultural background it is expected that the acquisition of LP will be highly advantageous.

In this study, we found the strongest genome-wide signal for selection at the LCT/MCM6 region in the HapMap Maasai using iHS selection scans. The only other HapMap population that showed a signal for selection in this region was the CEU group. Two different statistics that detect selection, iHS and PBS, indicated a stronger signal in the East African Maasai group compared with the European CEU group. The strong iHS signal might indicate stronger selection pressure in the Maasai, but it can also be an indication of more recent selection in Maasai compared with the CEU group or more efficient selection in the Maasai due to less genetic drift (larger Ne in Maasai). Indeed, Tishkoff et al,7 estimated a younger date (2700–6800) for the East African 14010C allele compared the European 13910T allele (8000–9000 years), although the time estimates had large overlapping confidence intervals. Furthermore, other factors such as different demographic histories of the two groups (ie, differences in effective population sizes and migration rates from neighboring populations) and the influence of ascertainment bias might also have a role, but the signal of selection is nevertheless stronger in the Maasai than in the CEU.

The haplotype network illustrated the two different haplotype backgrounds for the European and putative East African LP causing variants as was found by Tishkoff et al.7 The putative East African Maasai LP haplotype is at lower frequencies in the two other African groups (Yoruba and Luhya) and only one Northern European CEU individual carried this haplotype (Figure 2). The Tuscan group showed a higher frequency (31.3%) for this haplotype, though the iHS scan of the TSI did not show any signal of selection at the LCT locus (Supplementary Figure S1). However, LP has been shown to be present in 39.5% of Italians8, 11, 27 and the CEU LP haplotype was only present at 10.2% in Tuscans, which suggest that other polymorphisms might also be involved in the LP phenotype for this population. While the frequency of the putative East African LP haplotype is high in the Tuscans, without a direct survey of the −14010C LP causing variant in Tuscans, we can only speculate about the potential LP causing variant(s) in that population.

To conclude, our study documents a strong impact of recent positive selection on haplotype structure, variation, and differentiation associated with LP in the East African Maasai, and the genome-wide selection signal is greater than for the well-studied case of LP in Northwestern Europe.