Introduction

Bovine milk contains many essential nutrients, such as lactose (~ 48 g/L), fatty acids (~ 37 g/L), proteins (~ 34 g/L), and minerals (~ 9 g/L). Although less abundant than other solid components of milk, the major minerals—potassium (K), calcium (Ca), phosphorus (P), sodium (Na), and magnesium (Mg)—have an important effect on both human health and the cheese-making process. In humans, all of these minerals are necessary for many vital functions and therefore for the maintenance of good health. Dairy products can represent an important source of minerals in the human diet, especially of well-assimilated Ca1. In milk, minerals are found either in solution (soluble fraction) or in colloidal form (insoluble fraction). Some minerals are exclusively found in the soluble fraction (e.g., K and Na) while others exist in both fractions (e.g., Ca, P, and Mg). In the soluble fraction, Ca, P, and Mg exist in different forms, including ions and salts (phosphates and citrates), while in colloidal form, they are associated with casein molecules in the micelles and play a role in the structure and stability of these assemblages during the cheesemaking process2. Higher mineral concentrations are therefore associated with improved coagulation properties of milk3,4 and could enhance the nutritional value for human consumers.

Despite the potential benefits to human nutrition and milk processing, little is known about the genetic factors that influence milk mineral composition, mainly because the determination of mineral content via reference analyses is costly and time-consuming. A number of studies have reported genetic variation in milk mineral composition5,6,7,8,9,10,11, but to our knowledge, only two studies have conducted genome-wide association studies (GWAS) to investigate the genomic regions associated with these traits9,12. Both of those prior studies used 777 k SNPs and a relatively small sample of cows. As an alternative to reference analyses, mid-infrared (MIR) spectrometry can predict various milk components, including mineral fractions13,14,15, quickly and cheaply. Because of these advantages, milk MIR spectra are routinely recorded and stored. The combination of this technology with i) the widespread genotyping of cows for genomic selection, ii) the availability of whole-genome sequence (WGS) data from the 1000 Bull Genomes Project16, and iii) ever-increasing knowledge of the bovine genome17,18 creates the possibility of large-scale, high-resolution analyses for identification of the genes and variants that affect the mineral content of milk. We have previously applied this approach—whole-genome sequence-based GWAS combined with MIR predictions—to investigate milk protein composition19 and cheese-making traits20, and in both cases we succeeded in highlighting functional candidate genes. In particular, the genes SLC37A1 and ANKH were strongly linked with milk quality; both encode transmembrane proteins involved in ion transport and are therefore likely to have an effect on milk mineral composition. However, in these genes, as well as in other genes we identified, the variants with the most significant effects were mostly found in non-coding regions with limited annotation, which made it difficult to distinguish the causal variant. This pattern is quite general and many studies have reported the major role of regulatory variants in the architecture of complex traits21. To address this challenge, further investigation of non-coding regions is needed, particularly with respect to binding sites of transcription factors and non-coding RNA which could regulate gene expression.

The main objective of this study was to identify the best candidate genes and variants that might affect the content of Ca, P, Mg, K, Na, and citrate in milk, as predicted from MIR spectra. For this, we first conducted a GWAS on imputed WGS data of 19,586 Montbéliarde cows, and then performed post-GWAS analyses using different sources of annotation data to further refine our results.

Results

We analyzed six traits predicted from MIR spectra in Montbéliarde cows, representing the mineral (Ca, P, Mg, K, and Na) and citrate content of milk. MIR prediction equations originated from the Optimir project14,15. The accuracies of these MIR predictions, as assessed by the coefficient of determination (R2) in a validation population (Table 1), ranged from 0.68 to 0.90, with the exception of Na (0.44).

Table 1 Mean, standard deviation (SD), and accuracy (R2), estimated by cross-validation, of mid-infrared (MIR) predictions for concentrations of minerals and citrate in milk from Montbéliarde cows.

Heritability and genetic correlation estimates for mineral and citrate content

Genetic parameters, i.e. heritabilities (h2) and genetic correlations (rg), were estimated for milk mineral and citrate content, as predicted from more than 1 million test-day records from 126,873 cows (Table 2). At the test-day level, heritability estimates were moderate for Na content (h2 = 0.32) but higher for other minerals (h2 = 0.50 to 0.56) and citrate (h2 = 0.48). Na was negatively and poorly correlated with other minerals (-0.23 ≤ rg ≤  − 0.02) and citrate (rg =  − 0.15), while the levels of most other minerals were generally positively correlated (0.11 ≤ rg ≤ 0.60); the exception was the relationship between Ca and K (rg =  − 0.22). Values of the genetic correlation between citrate and mineral levels in milk ranged quite broadly, depending on the mineral: null with K (rg = 0.01 with K), slightly negative with Na (rg =  − 0.15) and P (rg =  − 0.16), and highly positive with Ca (rg = 0.57) and Mg (rg = 0.59).

Table 2 Estimates of heritability (in bold, diagonal) and genetic correlation (above the diagonal) for the concentrations of minerals and citrate in milk (SE < 0.01).

QTL identified in GWAS

We conducted single-trait GWASs on imputed WGSs from 19,586 cows (12,907,802 variants) and primarily identified 96 trait x region combinations with significant effects (-log10(P) ≥ 8.4) on milk mineral and citrate composition by applying the procedure described in the Methods section (Fig. 1). In two genomic regions (a 12Mbp-region on BTA1 and a 27Mbp-region on BTA20), the single marker analyses identified multiple trait x region combinations (10 and 15, respectively). Therefore, conditional analyses including the most significant variant identified in each region was applied to decipher if significant variants were due to linkage disequilibrium (LD) with the same causal mutation or to the presence of multiple causal mutations. These analyses led to the exclusion of 4 and 9 trait x region combinations on BTA1 and BTA20, respectively, resulting in 83 independent QTL. These QTL corresponded to 50 genomic regions located on Bos taurus (BTA) autosomes 1, 2, 3, 4, 5, 6, and 7 (Table 3) and 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 27, 28, and 29 (Table 4). Of these, 18 regions had effects on several traits (2 to 5), while 32 affected only one trait. The four regions with the most significant effects (-log10(P) ≥ 50) each affected four to five traits; these were located on BTA 1 (~ 142.8 Mbp, for P, K, Mg, and Na; Fig. 2), 20 (~ 58.2 Mbp, for citrate, Mg, K, and Ca; Fig. 2), 6 (~ 45.3 Mbp, for K, Ca, P, and Na), and 5 (~ 116.4 Mbp, for Ca, Mg, P, citrate, and Na). Ten other regions with smaller but still highly significant effects (20 ≤  − log10(P) < 50) were located on BTA 5 (~ 119.0 Mbp), 6 (~ 43.3, 48.1, and 85.6 Mbp), 11 (~ 103.2 Mbp), 17 (~ 50.8 Mbp), 19 (~ 60.6 Mbp), 20 (~ 55.9 and 60.8 Mbp), and 22 (~ 32.8 Mbp); these also generally affected multiple milk components. All remaining significant QTL regions (8.4 ≤  − log10(P) < 20) were spread across 22 different autosomes. Overall, 6 to 21 different QTL were identified for each trait, and their cumulative effects explained from 19.8% to 45.5% of the genetic variance of a given trait (Table 5). The most significant QTL, located on BTA1 and BTA20, explained more than 23% of the genetic variance of P and citrate, respectively.

Figure 1
figure 1

–log10(P) value of the effect of variants on milk mineral (Ca, P, Mg, K, and Na) or citrate content plotted against their position on Bos taurus autosomes.

Table 3 QTL identified for milk mineral and citrate content on Bos taurus (BTA) autosomes 1 to 10 (when a QTL region affected multiple traits, the most significantly affected trait is indicated in bold).
Table 4 QTL identified for milk mineral and citrate content on Bos taurus (BTA) autosomes 11 to 29 (when a QTL affected multiple traits, the most significantly affected trait is indicated in bold).
Figure 2
figure 2

–log10(P) values of the effects of variants on milk mineral (Ca, P, Mg, K, and Na) or citrate content plotted against their position on Bos taurus autosomes 1 and 20.

Table 5 Number of QTL and the percentage of genetic variance explained for each trait.

Annotation of GWAS peaks

The sizes of the confidence intervals (CI) of QTL ranged from 1 bp to 3.3 Mbp; each CI contained between 1 and 1004 variants with significant effects (99.3 on average). While only ca. 6% of the tested variants were included on SNP chips, these chip variants (mainly HD) were disproportionately represented among the variants ranked in the top 10 in QTL peaks, and accounted for nearly 43% of the variants ranked first in the peaks, i.e. “top 1” variants (Fig. 3). For less than one-third of the QTL detected (24/83), the variant with the most significant effect was not located in a gene. Depending on the set of significant variants under consideration (all, top 100, top 50, top 10, or top 1), between 25.1% (top 1 variants) and 28.8% (all variants in the CI of the QTL) of variants were intergenic, while 60.5% (top 1) to 65.7% (top 10) of variants were located in genes, including upstream and downstream regions (Table 6). In all QTL, we therefore identified at least one positional candidate gene (1 to 20 per QTL, 5.3 on average). In total, 271 different genes were found within the CIs of the 83 QTL. In the four QTL regions with the most significant effects, the top-ranked variants were located in the genes SLC37A1 (BTA1), ANKH (BTA20), SEL1L3 (BTA6), and PPARA (BTA5); and the top-ranked variants in the remaining QTL were located in dozens of other genes (Tables 3 and 4). The majority of the variants located in genes were intronic, and were only rarely found in coding regions: only 1%, approximately, of variants were non-synonymous, i.e. 84 variants within the CIs of the QTL, and 55, 24, 4, and 1 among the top 100, top 50, top 10 and top 1 variants, respectively (Table 6). Out of these non-synonymous variants located in coding regions of genes, 4 are serious candidates because they were found among the 10 most significant variants of the QTL peaks: G446S in SLC26A4 (ranked 2nd for the QTL found on BTA4 at 48.7Mbp for Ca), V298M in ENSBTAG00000050954 (ranked 10th for the QTL found on BTA7 at 40.3Mbp for P), F267V in ARHGAP39 (ranked 5th for the QTL found on BTA14 at 0.4Mbp for K and Citrate) and F257Y in GHR (ranked 1st for the QTL found on BTA20 at 31.9Mbp for P, Ca and Mg). We also found a relatively high proportion of variants located in non-coding RNA (7.9% of all variants in the CIs of the QTL and 13.6% of top 1 variants). As presented in Table 6, these variants were mostly located in long non-coding RNA (lncRNA), less frequently in micro RNA (miRNA) and small nuclear RNA (snRNA), and very rarely in small nucleolar RNA (small Cajal body-specific RNA, scaRNA), and transfer RNA (tRNA).

Figure 3
figure 3

Percentages of chip variants (50 K, HD, or “research” SNPs, in green) and imputed variants (in blue) found among the top 50 variants of QTL peaks. Dashed lines represent the total percentage of chip variants (6%, in green) and imputed variants (94%, in blue) of all variants that were tested in the GWAS.

Table 6 Annotation of variants located within confidence intervals of the QTL (a single variant could have different annotations).

We then completed annotation of the QTL (1) identifying putative regulatory SNPs (rSNPs) located within transcription factor binding sites (TFBSs) according to the procedure described in the Methods section, and (2) using transcriptomic data available on the Cattle Gene Atlas website, http://cattlegeneatlas.roslin.ed.ac.uk/.

In the CIs of 40 of the 83 QTL, we identified 184 variants that overlapped with the TFBSs of 438 transcription factors (TFs), i.e. putative rSNPs (see Supplementary Table S1). These 438 TFs had 67 unique target genes among the 271 positional candidates identified in the QTL. Each TF targeted from 1 to 5 positional candidate genes, and 110 TFs had binding sites (BSs) at multiple loci. One TF (ASCL2) targeted 5 genes (BHLHE23, DGKQ, RMND5B, bta-let-7a-3, and bta-mir-2443), while 2, 21, and 85 TFs targeted 4, 3, and 2 genes, respectively. Two top 1 variants, rs108972810 (~ 45.4 Mbp in the upstream region of SMIM20 on BTA6, for Na) and rs209051255 (~ 41.3 Mbp in the upstream region of ENSBTAG00000053872 on BTA7, for Ca) were in the TFBSs for the TFs DOF and FOXD3, respectively. In total, for 22 QTL, top 10 variants were found in the TFBSs of 19 different target genes (including CSN2, PAEP, GPAT4, BRI3BP, and PPARA), which are potentially regulated by 47 TFs.

We then assessed the degree to which these positional candidate genes and TFs demonstrated tissue-specific expression. Of the 271 genes and 438 TFs associated with milk-mineral QTL, 203 and 160, respectively, were present in the Cattle Gene Atlas, which contains expression data from 91 different tissues or cell lines. Using the model described in the Methods section, we assessed the overexpression or tissue specificity of these genes by evaluating the t-statistic and the P value, Pt, associated with the effect of the tissue under consideration (see Supplementary Tables S2 and S3). Among the 203 positional candidate genes and 160 TFs with a putative regulatory effect on these genes, we calculated the number that were overexpressed in each tissue type (Pt < 10−4). The profiles obtained for both categories of genes, presented in Fig. 4, were quite similar. The tissues or cell types in which we found the highest number of genes associated with milk mineral composition were white blood cells (51 candidate genes and 59 TFs) and mammary gland (47 candidate genes and 45 TFs). In each tissue, we identified the most-specific genes by retaining the top 10% with respect to t-statistic values, i.e. 20 candidate genes and 16 TFs. Among the candidate genes that were most specific to mammary gland, we found the genes encoding the main milk proteins (CSN1S1, CSN1S2, CSN2, CSN3, and PAEP), and SLC37A1 and ANKH, which are located within the QTL that had the most significant effects in the present study. The ASCL2 TF, which potentially regulates five genes located in the CIs of the QTL, was one of the most mammary-gland-specific TFs. Among genes specific to white blood cells, we found COTL1, MKL1, and FOXP1. SLC37A1 was not among the 20 most-specific genes but it was ranked 22nd. In the upstream or intronic regions of these four genes, we identified the top-ranked variant of four different QTL, located on BTA18, BTA5, BTA22, and BTA1, respectively. Furthermore, we identified 27 rSNPs for which both TF and target gene were highly specific to the mammary gland; these combinations were composed of 26 unique TFs and 15 unique target genes (see Supplementary Table S4).

Figure 4
figure 4

Counts of significantly expressed genes (P < 10−4) in 91 tissues, showing (a) positional candidate genes and (b) transcription factors identified as putative regulators of these genes.

Focus on the SLC37A1 and ANKH gene regions

As mentioned above, two QTL—located on BTA1 (142.8 Mbp) and BTA20 (58.2 Mbp) in the vicinity of the SLC37A1 and ANKH genes, respectively—had very strong effects on milk mineral and citrate content. The SLC37A1 region affected P, K, Mg, and Na levels while the ANKH region had effects on citrate, Mg, K, and Ca (Fig. 2). These two regions alone affected all six milk components analyzed in this study. They explained a high proportion of the genetic variance in P (23.3%), citrate (23.1%), Mg (17.9%, i.e. 7.3% for SLC37A1 and 10.6% for ANKH), and K (15.4%, i.e. 13.6% for SLC37A1 and 1.8% for ANKH), but had much more moderate effects on Na (2.6%) and Ca (2.2%).

In the SLC37A1 region, the variant with the most significant effect was located in an intron; it was an imputed variant (R2 = 0.85) for P and Mg (rs109459130 at 142,834,737 bp) and a chip variant (HD SNP) for K and Na (rs109717634 at 142,835,551 bp). For both variants, which were located 814 bp apart, the most frequent allele (0.58 and 0.61, respectively) increased the amount of all minerals affected by this region.

In the ANKH region, we identified three different top 1 variants, depending on the trait. For K, the top variant was rs134021638 (at 58,386,888 bp, imputed with R2 = 0.92), located in an intronic region of ANKH; for Ca, Mg and citrate, the top variants were located in an lncRNA ENSBTAG00000048498 (rs110048176 at 58,189,663 bp, HD SNP for Ca and Mg and rs109956167 at 58,204,929 bp, imputed variant with R2 = 0.90 for citrate). ENSBTAG00000048498 spans from 58,186,301 to 58,224,434 bp and is located around 83 kbp upstream of the ANKH gene (58,307,527–58,477,497 bp). In all cases, the alleles responsible for an increase in citrate and mineral content were not the most common, with a MAF of 0.05 for the ANKH variant (rs134021638) and 0.17 for the two other variants.

In both regions, we examined the top-ranked variant for the most-affected traits, i.e. rs109459130 and rs109956167, and tested the interaction effects of their genotypes on milk mineral and citrate content. The results, presented in Fig. 5, revealed no significant interaction for any of the traits analyzed (P ≥ 0.02), suggesting that the effects of the two regions on milk composition were additive.

Figure 5
figure 5

SLC37A1 and ANKH genotypes and genotype interaction effects on milk mineral (Ca, P, Mg, K, and Na) or citrate content.

Discussion

To the best of our knowledge, this study is unique because it is the first GWAS of imputed whole-genome sequences based on the most-recent bovine reference genome with such a large population of cattle (19,586 cows); and it is the first attempt to investigate milk mineral content with a sequence-based GWAS, which here assessed a very large panel of genomic polymorphisms (12.9 million).

Using this approach, we identified 83 QTL that explained a substantial part of the genetic variance (up to 42.2%) for mineral and citrate content in cows’ milk; in each QTL region, we then identified functional candidate genes and variants. Our results build on those of two previous studies that used GWAS to investigate the mineral composition of milk—based on mineral content measured with reference methods and HD 777 k SNP genotypes—in small populations of Holstein (3719 and 44412) or Jersey (3219) cows. The study of Buitenhuis et al.9 was dedicated to mineral composition while Kemper et al.12 attempted to identify QTL that overlapped between milk production and composition traits. Only two QTL were shared between the two studies: a region located on BTA1 that affected P, for which both studies identified SLC37A1 as the best functional candidate gene, and a region located near the DGAT1 gene on BTA14 that affected Ca and P. In our study, we also identified these two regions, together with dozens of other regions located throughout the bovine genome.

By considering a very large dataset (more than 1 million test-day records of 126,873 cows), we estimated heritability values for mineral and citrate composition to be moderate to high, i.e. higher5,7,8,10 than or similar6,9 to those previously reported in the literature. Furthermore, we found strong genetic correlations among Ca, P, and Mg and among Ca, Mg, and citrate, which was consistent with the studies of Toffanin et al.5 and Denholm et al.8. Among minerals, Na had both the lowest heritability value (0.32 vs 0.48–0.56 for the other minerals) and the smallest percentage of genetic variance explained by its associated QTL (19.8% vs 33.1–42.2% for the other minerals). Ca, P, and Mg shared more QTL than any other group of minerals: 7, 6, and 6 QTL were shared between Ca and P, Ca and Mg, and P and Mg, respectively, while 4 QTL had effects on all three. In contrast, despite the high degree of genetic correlation between Ca, Mg, and citrate, only 2 QTL were shared among these three traits; however, one of these was a QTL located on BTA20 (~ 58.2 Mbp) that explained 2.2%, 10.6%, and 23.1% of the genetic variance of Ca, Mg, and citrate, respectively. Overall, estimates of genetic correlations were consistent with QTL results, which probably reflects the common biological pathway of these minerals in milk. In colloidal form, Ca, P, and Mg are associated with caseins in the micelles while in the soluble fraction, Ca and Mg are associated with citrate. Further studies, using random regression models fitted across lactation, could investigate the pattern of genetic relationships between the different milk minerals during lactation and thus provide a better understanding of the underlying biological mechanisms.

The resolution of our study was high enough to identify a single or a few positional candidate genes in most of the 83 QTL we identified. In each of these regions, though, the variant with the most significant effect was not necessarily located in a gene. As an example, we detected the overrepresentation of chip SNPs at the top of the QTL peaks (42.7% of the top 1 variants), and the majority of these SNPs were located in intergenic regions. Chip SNPs are directly genotyped or have a higher imputation accuracy than the surrounding variants, which probably enabled more precise estimation of their effects. To identify the best candidate variant(s) for each QTL, we therefore considered not only the variant with the most significant effect, but the set of variants ranked at the top of the peak that were located in or close to genes. In most cases, we found that these variants were located in non-coding regions of genes; for example, 15.3% and 12.1% of all top 10 variants were located in upstream regions of genes or in lncRNA, respectively, i.e. in putative regulatory regions. To further refine our results, we consulted in silico annotations of rSNPs in upstream regions of genes as well as existing knowledge regarding genomic regions that are transcribed into non-coding RNA. Moreover, to support the putative roles of these variants, we also considered the tissue specificity of candidate genes and transcription factors. Unfortunately, the expression dataset, which was based on Ensembl release 94, did not contain all of the genes located in QTL regions. Here we present some examples of QTL in which the putative causal mutation was located in a regulatory region, with particular attention to functional candidate genes previously associated with milk composition and the QTL with the most significant effects in this study.

GPAT4 (glycerol-3-phosphate acyltransferase 4) was the best candidate gene for the QTL on BTA 27 with effects on Mg. Although it was not the most notable QTL in terms of its effect on minerals, this region, and this gene in particular, have been highlighted by previous studies as affecting milk composition (fat, protein, and lactose content)19,22,23,24. In these studies, five variants in linkage disequilibrium—four in the upstream region and one in the 5′ UTR region of GPAT4—were highlighted as the best candidate causal variants. In our study, these variants were ranked in the top 5 in the GWAS peak; the variants ranked 3rd (rs209479876) and 5th (rs209855549) alter the WRKY48 and TWI transcription factor binding sites (TFBSs), respectively, and are more likely to be the causal variants. Daetwyler et al.23 also highlighted rs209479876 as the best causal variant in this region because of the high probability that it overlaps a TFBS. We also confirmed that GPAT4 was overexpressed in the mammary glands (P = 3.10−12) compared to 90 other tissues or cell types, but expression data were not available for the TFs associated with this gene.

For the QTL identified on BTA11 at ~ 103.2 Mbp with effects on P and citrate, the top two variants in the peak for P (the most-affected trait) were intergenic. Instead, the variants ranked 3rd to 7th were located in the upstream region of the PAEP (progestogen-associated endometrial protein) gene; the 5th-ranked variant, rs110710904, probably alters a binding site of the Macho-1 transcription factor (P = 0.02). PAEP, which is one of the most mammary-gland-specific genes (Pt = 1.8.10−37), encodes β-lactoglobulin, the most abundant whey protein in cow milk. Two non-synonymous variants in this gene, rs109625649 and rs110066229, were previously highlighted as the causal mutations underlying β-lactoglobulin concentration in milk25. In our study, these were respectively ranked 157th and 171st in the peak, i.e. far below the top-ranked variants upstream. These results corroborate previous reports that these two putatively causative missense mutations did not explain all the effects of the region on milk composition19,20 and highlight an rSNP as a likely causative variant.

BRI3BP (BRI3 binding protein), located on BTA 17 at ~ 50.8 Mbp, has been previously associated with de novo short chain fatty acid synthesis in bovine milk26,27. In our study, this region appeared to have effects on the K content of milk, and we found that BRI3BP was significantly overexpressed in mammary gland compared to the 90 other tissues investigated (P = 3.9.10−8). The first seven variants in the peak were located in an intronic region of BRI3BP. The 8th (rs477456528) and 9th (rs440703666) variants, instead, were located in the upstream region of the gene and rs440703666 probably alter the binding site of the transcription factor SPL8. This variant thus represents a very attractive functional candidate in this region.

In the four regions with the most significant effects on milk minerals, we identified SLC37A1 (BTA 1), ANKH (BTA 20), SEL1L3 / SMIM20 (BTA 6), and PPARA (BTA 5) as the best candidate genes.

Five of the six traits analyzed in this study were affected by a QTL region at ~ 116.4 Mbp on BTA5. Here, a variant located at 116,438,773 bp (1000G_80994138) ranked 2nd for four traits. This variant was located in the upstream region of PPARA (peroxisome proliferator activated receptor alpha) and overlapped the binding sites for the transcription factors E2F4, RSC3, RSC30, TDA9, and TFDP1. In this QTL region, top 1 variants were located either in an intronic region of PPARA (citrate, Mg, and P) or in an lncRNA, LOC101903383 (Ca and Na). PPARA encodes the PPAR-α transcription factor, a key regulator of lipid metabolism belonging to the superfamily of PPAR hormone receptors28; expression of this TF in mammary gland was previously found to be associated with milk fatty-acid composition in dairy cows29,30. Here, we did not detect any mammary-gland specificity of PPARA but we did observe that this gene was expressed in this tissue. It is therefore possible that polymorphisms in the binding sites of this TF or in the lncRNA LOC101903383, located 1.3 kbp upstream of PPARA, could regulate the expression of PPARA and be responsible for the strong effects of this region on milk composition.

On BTA6 at ~ 45.3 Mbp, we identified a QTL with very strong significant effects on K and to a lesser extent on P, Na, and Ca. Within the confidence interval of this QTL, we found 80 different potential variants, of which the majority (69) were intergenic, 9 were located in an intronic region of SEL1L3 (SEL1L family member 3), and 2 were in the upstream region of SMIM20 (small integral membrane protein 20). However, when we examined only the top 10 variants for the different traits, the number of distinct variants shrank to 18 (9 intergenic, 7 in SEL1L3, and 2 in SMIM20). Of these, two (rs108972810 at 45,401,485 bp and rs136498639 at 45,401,570 bp) were located in the upstream region of SMIM20 and overlapped a TFBS. SEL1L3 was previously identified as a candidate gene in a QTL region with effects on bovine milk protein composition19. Instead, to the best of our knowledge, this study is the first to propose SMIM20 as a candidate gene for milk composition. However, the functional link between these genes and milk composition has yet to be established, as we found both genes to be endometrium-specific and underexpressed in mammary glands compared to the 90 other tissues investigated.

For milk mineral composition, the two best functional candidate genes highlighted by our analysis were SLC37A1 (solute carrier family 37, member A1) and ANKH (inorganic pyrophosphate transport regulator). These two genes were previously found to be overexpressed in mammary glands relative to 17 other types of tissue31, and here we found both among the top 10% of mammary-gland-specific genes (Pt = 3.4.10−26 and 6.6.10−16, respectively). This suggests that the main function of these genes occurs in epithelial cells of this tissue. SLC37A1 and ANKH both encode transmembrane proteins involved in ion transport and have been found to play a role in inorganic anion transport20. Variants located in both genes explained a large degree of the genetic variation in milk mineral content. Earlier studies also linked SLC37A1 with mineral content, in particular that of P9,12, while both genes have been implicated in determining milk protein composition in Holstein, Montbéliarde, and Normande cows19 and cheese-making traits in the same Montbéliarde cows we analyzed here20. These results are consistent with what is known about the association of minerals with casein molecules in micelles; milk mineral composition is strongly related to milk protein composition2 and therefore to cheese-making traits3,4.

In SLC37A1, 7, 31, 9, and 7 variants were located in the confidence intervals of QTL for P, K, Mg, and Na, respectively. These variants (32 of which were unique) were overwhelmingly located in introns of the gene, in a 12.3-kbp-region from 142,826,156 to 142,838,477 bp. Among these 32 variants, we were not able to distinguish the best candidate based on annotation, but two variants, 814 bp apart, were particularly highly ranked for all traits. Specifically, rs109459130, at 142,834,737 bp, was ranked 1st, 2nd, 1st, and 3rd for P, K, Mg, and Na, respectively, while rs109717634, at 142,835,551 bp, was ranked 2nd, 1st, 3rd, and 1st for P, K, Mg, and Na, respectively. Of these, the first (rs109459130) appeared to be the better candidate because it was top-ranked for the most affected trait, P, and because it was imputed (R2 = 0.85), while the second variant was an HD SNP (R2 = 0.997).

In the region of the ANKH gene, we found 81, 65, 69, and 21 variants in the confidence intervals of QTL for citrate, Mg, K, and Ca, respectively. These represented 82 unique variants, all located either in intronic regions of the ANKH gene (59 variants between 58,344,839 and 58,388,849 bp) or in the lncRNA ENSBTAG00000048498 (23 variants between 58,185,895 and 58,212,187 bp). The same variant was ranked first for both Mg and Ca. Two top-1 variants were located in the lncRNA ENSBTAG00000048498—rs109956167 (at 58,204,929 bp) for citrate and rs110048176 (at 58,189,663 bp) for Mg and Ca—while the top-ranked variant for K was located in ANKH (rs134021638 at 58,386,888 bp). We propose rs109956167 as the best candidate causal variant because i) it was ranked 1st for the most-affected trait (citrate), ii) it was imputed, in contrast to rs110048176 which was from the HD chip, and iii) it was the only variant present in the top 10 for all traits affected by this region (4th for Mg, 6th for Ca, and 9th for K). We hypothesize that the lncRNA ENSBTAG00000048498, which is 83 kbp downstream of the closest gene, ANKH, may affect milk mineral content by regulating the expression of ANKH and thus the amount of protein available for ion transport.

When we examined the best candidate variants in the regions of SLC37A1 (rs109459130) and ANKH (rs109956167), we found that alleles of these genes did not have significant interactions with respect to the six milk composition traits analyzed in this study and that, combined together, these two regions explained a large part of the genetic variance in milk mineral content. The allele that increased mineral content was the most frequent allele of SLC37A1 (allele A, frequency = 0.58), while it was the least frequent allele in the ANKH region (allele C, frequency = 0.17).

Compared to previous studies, our GWAS-based investigation of milk mineral composition was conducted at the whole-genome level, after imputation with the most recent reference genome and using data from a large population of animals. This approach led to the identification of a large number of genomic regions that explain a great deal of the genetic variance of the traits studied here. Moreover, post-GWAS investigations conducted using different annotation datasets enabled the identification of candidate causal genes and the prioritization of candidate variants in these genomic regions. In most situations, exonic variants were not proposed as the best candidates, although they cannot be excluded in four QTL regions, including GHR. In contrast, the best candidate variants often were putative regulatory variants. Although the functional causative effect of the candidate variants remains to be demonstrated, this study highlights a ‘short list’ of candidate variants, in particular rs109459130 (SLC37A1) and rs109956167 (ANKH), whose favorable alleles can be feasibly selected to increase the mineral content (Ca, P, Mg, and K) of milk and therefore improve both its nutritional and cheese-making qualities in Montbéliarde cows.

Methods

Ethics statement

Milk samples were analyzed with MIR spectrometry during routine milk recording in commercial herds of Montbéliarde cows in the Franche-Comté region (France). We did not perform any experiments on animals and no ethical approval was required.

Animals and phenotypes

The original dataset was generated by the From’MIR project and is described in detail by Sanchez et al.11,20. It comprised 6,670,769 mid-infrared (MIR) spectra of milk samples from 410,622 Montbéliarde cows. The concentrations of five milk minerals (Ca, P, Mg, K, and Na) and citrate were predicted from MIR spectra using equations developed by the Optimir project14,15 (Table 1). As previously described11,20, to ensure that the dataset was homogeneous (and less computationally demanding), we retained only the first-lactation records with at least seven test-day records per cow for estimating variance components (1,100,238 test-day records from 126,873 cows11). To estimate environmental effects used to derive the phenotypes included in GWAS, at least three test-day records per cow were required, corresponding to a dataset of 1,442,371 test-day records from 189,817 cows20.

Estimation of genetic parameters

First lactation data were analyzed using bivariate repeatability animal models applied to all pairs of traits. (Co)variance components were estimated using the AI-REML algorithm as implemented in Wombat software32, with the following linear animal model:

$$ {\mathbf{y}} = {\mathbf{X\beta }} + {\mathbf{Za}} + {\mathbf{Wp}} + {\mathbf{e}}_{1} , $$
(1)

where \({\mathbf{y}}\) is the vector of test-day observations; a ~ \({\mathbf{N}}\left( {0,{\mathbf{A}} \otimes {\mathbf{G}}_{0} } \right)\) is the vector of random additive genetic effects, p ~ \({\mathbf{N}}\left( {0,{\mathbf{I}} \otimes {\mathbf{P}}_{0} } \right)\) is the vector of random permanent environmental effects, and \({\varvec{e}}_{1}\) ~ \({\mathbf{N}}\left( {0,{\mathbf{I}} \otimes {\mathbf{R}}_{0} } \right)\) is the vector of random residual effects.X, Z, and W are incidence matrices, A is the relationship matrix among individuals calculated from the pedigree (traced back over four generations and containing 315,661 animals), and I the identity matrix. \({\mathbf{G}}_{0}\), \({\mathbf{P}}_{0}\) and \({\mathbf{R}}_{0}\) are 2 × 2 matrices of additive genetic, permanent environmental, and residual variances-covariances, respectively. The \({{\varvec{\upbeta}}}\) vector included the fixed effects of the herd x test-day x spectrometer combination, stage of lactation, and season of calving.

Genotypes and imputation to whole-genome sequences

A subset of 19,586 cows for which MIR spectra were available had been genotyped for the purpose of genomic selection using the BovineSNP50 (50 K, 6505 cows) or the EuroG10K BeadChip (Illumina Inc., San Diego, 13,081 cows); the latter contains generic SNPs and a research add-on for causal or predictive SNPs for traits of interest in cattle. Missing genotypes were imputed using FImpute software33 for the 53,469 autosomal SNPs (50 K and “research” SNPs, 50 K+) that passed all quality control filters (individual call rate > 95%; SNP call rate > 90%; minor allele frequency (MAF) > 1% in at least one major French dairy cattle breed; genotype frequencies in HW equilibrium with P > 10−4). Whole-genome sequences (WGSs) were then imputed in two steps. In the first step, 777 K high-density and “research” SNPs (HD+) were imputed with FImpute software using a within-breed reference set of 522 Montbéliarde bulls genotyped with the Illumina BovineHD BeadChip (Illumina Inc., San Diego, CA)34. Finally, allele dosages of the WGSs were imputed using Minimac software35 and WGS variants of 1479 Bos taurus animals from the 7th run of the 1000 Bull Genomes Project, representing 17 cattle breeds and including 63 Montbéliarde bulls. This 2-step strategy was found to be the most accurate36 (eg, Boowman & Veerkamp, 2014) because (1) the first step takes advantage of the high number of HD genotypes of influential bulls and is performed with very limited loss in accuracy34 (Hozé et al., 2013) ; (2) the number of sequenced Montbéliarde bulls was too limited for an accurate imputation to sequence when used alone; and (3) linkage disequilibrium is partially conserved across breeds at the HD level (~ several kb), making use of close breeds sequence data very beneficial. WGS variants were selected following the protocol defined by the 1000 Bull Genomes consortium16,23, as described in Boussaha et al.37. Short reads were filtered for quality and aligned to the ARS-UCD1.2 reference sequence17, and small genomic variations (SNPs and InDels) were detected using SAMtools 0.0.1838. Raw variants were then filtered to produce a dataset of 25,050,323 variants. The precision of imputation from HD+ to WGS was assessed using R2 values calculated with Minimac software35. Only variants with R2 ≥ 0.20 and MAF ≥ 0.005 were retained for association analyses, i.e. 12,907,802 variants, with a mean R2 of 0.67 and MAF of 0.19.

GWAS

We performed single-trait association analyses between all 12,907,802 polymorphic variants and each of six milk mineral and citrate traits, described in Table 1. All association analyses were performed using the mlma option of GCTA software (version 1.24), which applies a mixed linear model that includes the variant to be tested39:

$$ {\mathbf{yd}}{ } ={\mathbf{1}}m + { }{\mathbf{x}}_{{\mathbf{v}}} {\text{b}}_{{\text{v}}} + { }{\mathbf{u}}{ } + { }{{\varvec{\upvarepsilon}}}, $$
(2)

where \({\mathbf{yd}}\) is the vector of so-called yield deviations, i.e. test-day records adjusted for non-genetic effects with the mixed linear model (1) using Genekit software40 and averaged per cow; \(m\) is the overall mean; \({\text{b}}_{{\text{v}}}\) is the additive fixed effect of the variant to be tested for association; \({\mathbf{x}}_{{\mathbf{v}}}\) is the vector of imputed allele dosages, ranging from 0 to 2; \({\varvec{u}} \sim N\left( {0, \user2{G\sigma }_{u}^{2} } \right)\) is the vector of random polygenic effects, with \({\varvec{G}}\) the genomic relationship matrix (GRM) calculated using the HD SNP genotypes (which offer both high density and accuracy of imputation), and \(\sigma_{u}^{2}\) the polygenic variance, estimated based on the null model \(\left( {{\mathbf{yd}} = {\mathbf{1}}m + {\mathbf{u}} + {\mathbf{e}}} \right)\) and then assumed as known while testing for the association between each variant and the trait of interest; and \({\varvec{\varepsilon}} \sim N\left( {0, I{\varvec{\sigma}}_{{\varvec{\varepsilon}}}^{2} } \right)\) is the vector of random residual effects, with I the identity matrix and \( \sigma^{2}\) the residual variance. Association was tested using a t-statistic calculated by dividing the variant effect estimate by its standard error.

In order to correct for multiple testing, the Bonferroni correction was applied to take into account all 12.9 million independent tests. The 5% genome-wide threshold of significance therefore corresponded to a nominal P-value of 4 × 10−9 (-log10(P) = 8.4) per test. When a given trait was significantly affected by multiple variants, variants that were located less than 2 million base-pairs (Mbp) apart were grouped together to define QTL, considering the variants belonged to the same QTL region. The bounds of the confidence intervals (CIs) of each region were then determined based on the positions of variants that were included in the upper third of the QTL peak.

In regions where multiple neighboring QTL were identified (BTA1 and BTA20), conditional analyses were carried out using the cojo option of GCTA39 in order to conclude if multiple significant variants in a genomic region were due to LD with the same causal mutation or to the presence of multiple causal mutations. Association analyses were performed by including in the model the most significant variant as a fixed effect and by testing all variants in these neighboring QTL that were not in strong LD with the conditional variant (r2 < 0.9).

Functional annotations

Genomic regions and variants were annotated with FAANGMine v1.1 (https://faangmine.elsiklab.missouri.edu/), developed by the Functional Annotation of ANimal Genomes initiative18 and which integrates the ARS-UCD1.2 bovine reference genome with a variety of external data sources, including RefSeq from NCBI (https://www.ncbi.nlm.nih.gov) and Ensembl (https://www.ensembl.org) gene sets.

The ability of genetic variants to alter transcription factor binding sites (TFBSs) was predicted with a custom script that used TFBS models from the JASPAR (JASPAR CORE 2018 collection41), HOCOMOCO (version v1042), and TRANSFAC (version v3.2 public43) databases. These databases contain curated sets of transcription factor binding models represented as Position Weight Matrices (PWM), which are derived from published collections of experimentally defined eukaryote TFBSs. Only vertebrate PWMs were downloaded for use in our study.

Gene overexpression or specificity in different tissues was determined using gene expression patterns of 24,616 genes (Ensembl release 94) available in the Cattle Gene Atlas, which contains 723 RNA-seq datasets representing 91 tissues and cell types, classified into 17 biological categories (http://cattlegeneatlas.roslin.ed.ac.uk/). To assess the expression specificity of each gene in a given type of tissue (by excluding tissues in the same biological category), we applied the following linear model as described in Fang et al.44:

$$ {\mathbf{ye}}{ } = {\mathbf{1}}me + { }{\mathbf{x}}_{{\mathbf{t}}} {\text{b}}_{{\text{t}}} + { }{\mathbf{zc}}{ } + { }{\mathbf{ee}} $$
(3)

where \({\mathbf{ye}}\) is the vector of expression level in the tissues, assessed by the scaled log2FPKM (Fragments Per Kilobase per Million mapped reads); \(me\) is the overall mean; \({\mathbf{x}}_{{\mathbf{t}}}\) is the vector of the variable with value 1 for samples of the tested tissue and − 1 for samples outside the same category; \({\text{b}}_{{\text{t}}}\) is the corresponding tissue effect; \({\mathbf{z}}\) is the incidence matrix related to the corresponding covariables effects \({\mathbf{c}}\), including age, sex and study effects; \({\mathbf{ee}}\) is the residual effect. Model (3) was implemented adapting R scripts available on the Cattle Gene Atlas website. For each gene, a t-statistic was computed by dividing the tissue effect by its standard error. A gene was considered to be overexpressed in a tissue if the probability associated with the t-statistic (Pt) was lower than 10−4. To determine the tissue specificity of genes, we ranked the genes in each type of tissue by their t-statistics; the top 10% were considered to be tissue-specific.

SLC37A1-ANKH genotype interactions

We tested putative interaction effects between the genes SLC37A1 and ANKH on milk mineral and citrate content with the following mixed linear model:

$$ {\mathbf{yd}} = {\mathbf{1}}mi + { }{\mathbf{Mg}}_{{{\text{SLC}}37{\text{A}}1}} { } + { }{\mathbf{Ng}}_{{{\text{ANKH}}}} + {\mathbf{Og}}_{{{\text{SLC}}37{\text{A}}1}} {\text{ x }}{\mathbf{g}}_{{{\text{ANKH}}}} + { }{\mathbf{Ps}}{ } + { }{\mathbf{ei}}, $$
(4)

where \({\mathbf{yd}}\) as defined in (2); \({\text{mi}}\) is the overall mean; \({\text{g}}_{{{\text{SLC}}37{\text{A}}1}}\) and \({\text{g}}_{{{\text{ANKH}}}}\) are the fixed effects of the genotypes of the best candidate variant in the SLC37A1 and ANKH regions, respectively; \({\text{g}}_{{{\text{SLC}}37{\text{A}}1}} {\text{ x g}}_{{{\text{ANKH}}}}\) represents the interaction between the two genotypes;\(\user2{ }{\mathbf{M}}, {\mathbf{N}}\), and \({\mathbf{O}}\) are incidence matrices related to the individual effects of the SLC37A1 and ANKH genotypes and their interaction, respectively; \({\mathbf{s}}\) is the vector of random sire effects and P the corresponding incidence matrix; and \({\mathbf{ei}}\) is the vector of random residual effects. All effects were tested using t-statistics computed in the MIXED procedure of SAS software.