Introduction

The implementation of genetic improvement programs in domestic populations to enhance desirable production traits, combined with the natural adaptation to new environments, has led to a variety of phenotypic variation between domestic populations of plants and animals1 as well as a strong differentiation between wild and domestic populations. Both natural and artificial selection in domestic species have left footprints across the genome, known as selection signatures, which can point to regions harboring functionally important sequences of DNA2,3. Theoretically, a beneficial variant that has been under selection can show long-range linkage disequilibrium (LD) and high frequency over a long period of time4,5. Therefore, these patterns allow us to detect such selection signatures in genomes. When successful, selection signature identification can provide novel insights into mechanisms that create diversity across populations and contribute to mapping of genomic regions underlying selected traits or phenotypes6,7. Approaches for detecting selection signatures rely on scanning variants across the genome of interest, searching for regions in which (i) the allele frequency spectrum is shifted towards extreme (high or low) frequencies; (ii) there is an excess of homozygous genotypes; (iii) there are long haplotypes with high frequencies and (iv) there is an extreme differentiation among populations. Several statistical methods has been developed to search for selection signatures, such as extended haplotype homozygosity (EHH)8, integrated haplotype score (iHS)9, hapFLK10, Cross Population Extended Haplotype Homozogysity (XP-EHH)11, the composite likelihood ratio (CLR); runs of homozygosity (ROH)12 and FST statistics13.

In fishes, domestication is recent in comparison with other land animals14, although some evidence of fish farming dates back approximately 3,500 years ago15. An exponential development of aquaculture has occurred since 1960, relying on the domestication of a handful aquatic species14. Salmonid species in particular have seen very intensive production increases over the last four decades. Currently, the most important farmed salmonids species in the word are Atlantic salmon (Salmo salar), rainbow trout (Oncorhynchus mykiss) and coho salmon (Oncorhynchus kisutch). Coho salmon belong to the Oncorhynchus genus and are naturally distributed across Pacific North American and Asian watersheds16. About 90% of farmed coho salmon production is based in Chile17. Farmed populations of coho salmon were established in Chile at the end the 1970′s, with the importation of ova from the Kitimat river (British Columbia) and the US state of Oregon18. During the 1990´s, a variety of breeding programs have been implemented for this species in Chile18, mainly focused on improving growth, disease resistance traits and flesh color18. The implementation of such breeding programs have likely, shaped the genetic diversity and haplotype structure of these populations, and their present genomes may contain traceable signatures of selection.

Genomic scans for detecting selection signatures have been successfully applied to several domestic animals, including aquaculture species. For instance, selection signature scans have been carried out for Atlantic salmon19,20,21,22,23,24, Nile tilapia25,26; and channel catfish27. To date, there are no studies focusing on the identification of selection signatures in farmed coho salmon populations. To increase the knowledge of domestication and selection effects on the genome diversity in farmed populations of this species, we searched for selection signatures in two groups comprising farmed individuals of coho salmon using different statistical approaches based in haplotype structure and allele frequency spectrum. Tracing signatures of selection in this species could provide further insights on genomic regions responsible for important aquaculture features of domestic Pacific salmon.

Methods

Populations

In this study we used samples representing two different lines that come from a breeding program of coho salmon maintained in Chile, hereinafter pupulations A (Pop-A) and B (Pop-B): we used 89 individuals for A and 43 individuals for B. These breeding populations were established in 1997 and 1998, belonging to even and odd spawning year, respectively28. According to previous analyses (Ben Koop, unpublished data), it is very likely that these two populations correspond to a mixture of the two original broodstocks (Kitimat river and Oregon). Both populations have been selected to improve growth rate for around eight generations by using Best Linear Unbiased Prediction29,30 with genetic gains ranging between 6 to 13% per generation31. Sampling protocols were performed in accordance with the Comité de Bioética Animal, Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile, Chile (certificate No. 08-2015). The study was carried out in compliance with the ARRIVE guidelines.

Genotyping

Genomic DNA was extracted from fin clips of each individual and genotyped with a 200 K Thermo Fisher Scientific Axiom myDesign Custom Array developed by the EPIC4 genome consortium32 (http://www.sfu.ca/epic4/).

Genotype calling was performed using Axiom Analysis Suite v3.1 (Thermo Fisher Scientific) following the Axiom Analysis user guide. After filtering, including removing markers with no position on the coho salmon reference genome (GCF_002021735.1) and markers that were identified as problematic (OTV, Call Rate Below Threshold, Other), 167,486 SNP were kept32. In this study, we removed markers positioned in scaffolds, which were not placed within chromosomes; therefore 136,500 markers were kept for downstream analyses. All these markers and samples passed the quality criteria (missing call rate ≤ 0.1).

Genetic diversity, linkage disequilibrium (LD) and population structure

A common subset of 72,616 SNPs with MAF ≥ 0.05 and no deviation from Hardy–Weinberg equilibrium in populations A and B were explored to describe the genetic diversity of each population. Three parameters, including the number of SNPs with MAF ≥ 0.05; observed heterozygosity (HO), and expected heterozygosity (HE) were calculated using PLINK33.

We used the squared correlation coefficient between SNP pairs (r2) to measure LD34, which was calculated for all syntenic marker pairs. To enable a clear presentation of results showing LD in relation to physical distance between markers, SNP pairs were put into bins of 100 kb, and mean values of r2 were calculated for each bin. The mean r2 for each of the distance bins was then plotted against the distance bin range (Mb). This analysis was carried out on a chromosome-by-chromosome basis.

To visualize genetic differentiation between populations a Principal component analysis (PCA) was conducted using PLINK. In addition, ADMIXTURE35 was employed to analyze the population structure, which was run with 10,000 iterations using the correlated allele model with K value from 1 to 20 to choose the optimal number of clusters at the lowest cross-validation error. Results were plotted using R36.

Selection signatures

We combined three methods (XP-EHH, iHS and CLR) which have shown to have a power > 70% to detect selection signatures in comparison with other combinations of statistical tests5. Removing markers deviating from Hardy–Weinberg equilibrium might be obfuscating the important deviations such as those resulting from evolutionary selection37, therefore, we did not apply Hardy–Weinberg equilibrium filtering for selection signatures analyses. In addition, different MAF thresholds were used for each selection signatures test, which are described below.

  1. (1)

    XP-EHH. The XP-EHH (Cross Population Extended Haplotype Homozygosity) statistic compares the integrated extended haplotype homozygosity (EHH) profiles between two populations at the same SNP11. Therefore, the computation of EHH is required for each population. The XP-EHH statistic is then defined as ln(IPopA/IPopB), where IPopA is the integrated EHH for the population A and IPopB is the integrated EHH for population B. Negative XP-EHH scores suggest selection in B, whereas positive scores suggest selection acting in the A. A common subset of 89,715 SNPs with MAF > 0.01 was used for XP-EHH test, which was computed for each SNP using the R package rehh 3.1.138, with default options. As the XP-EHH values are normally distributed, a Z-test is applied to identify significant SNPs under selection. We used -log10(p value) = 3 (p value ≤ 0.001) as the threshold for considering XP-EHH score as significant evidence of selection and at least two SNPs ≤ 500 kb apart.

  2. (2)

    iHS. The iHS statistics was used to detect footprints of selection within the studied populations. This test is based on the standardized log ratio of integrals of the decay of the EHH, computed for both ancestral and derived alleles of the focal SNP. Selection induces hitchhiking (genetic draft), that leads to extended haplotypes for the selected allele and a slower fall-off of EHH on either side of the selected locus. Thus, a high iHS scoring SNP is typically associated with longer haplotypes and lower neighborhood diversity compared to the other SNPs39. Phased haplotypes were obtained using Beagle40. The ancestral and derived alleles of each SNP were inferred in two ways: (i) the SNP probes were compared with the genome of Atlantic salmon, which was used as an outgroup species; (ii) for SNPs that could not be compared, the ancestral allele were inferred as the most common allele in the total dataset, as suggested in other studies (e.g.41,42). SNPs with MAF < 0.05 were excluded from iHS analysis in accordance with9, therefore, we used 95,217 and 90,101 SNPs for populations A and B, respectively. The iHS score was computed for each autosomal SNP using the R package rehh 3.1.138, using default options. Similar to XP-EHH, iHS values are normally distributed, and a Z-test is applied to identify significant SNPs under selection. We also used -log10(p value) = 3 (p value ≤ 0.001) as the threshold for considering iHS score as significant evidence of selection and at least two SNPs ≤ 500 kb apart.

  3. (3)

    CLR: We implemented CLR test in the software SweeD V3.3.243, downloaded from https://cme.h-its.org/exelixis/web/software/sweed/. SweeD evaluates the variation in the site-frequency spectrum along the chromosome and implements the composite likelihood ratio (CLR) statistics44. CLR computes the ratio of the likelihood of a selective sweep at a given position to the likelihood of a null model without a selective sweep. Following other studies (e.i.45,46,47), SNPs with MAF < 0.05 were excluded from CLR analysis as well, therefore, the same subset of SNPs were used as for iHS. We calculated the CLR within each population and for each chromosome separately at grid points for every 20 kb. We used the 99.5th percentile of the distribution of CLR scores as threshold for the detection of outliers.

Candidate genes and functional analysis

A genomic region was considered as being under selection if it matched the criteria described above for iHS, XP-EHH and CLR. For each identified selective sweep region, we extended the region containing outlier scores by adding 250 kb to each end, to account for potential blocks of high linkage disequilibrium. Gene annotation was performed using GCF_002021735.1 coho salmon genome assembly (Accession PRJNA378663, Okis_V1; https://www.ncbi.nlm.nih.gov/assembly/GCF_002021735.1)48 from the NCBI Eukaryotic genome annotation pipeline.

The Database for Annotation, Visualization, and Integrated Discovery (DAVID) v6.8 tool49 was used to identify Gene Ontology (GO) terms and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways using a list of genes within significant regions based on iHS, XP-EHH, CLR values and the zebrafish annotation file as a reference genome.

Ethics approval and consent to participate

Coho salmon individuals and sampling procedures were approved by the Comité de Bioética Animal from the Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile (certificate Nº No. 08-2015).

Consent for publication

Not applicable.

Results

Population B has a slightly greater HO = 0.39 (HE = 0.36) than Pop-A HO = 0.37 (HE = 0.36). The number of SNPs with MAF ≥ 0.05 was higher in Pop-A (95,217) than Pop-B (90,101). In addition LD presented a higher level in Pop-B (r 2= 0.14) than Pop-A (r2 = 0.10) (Fig. 1). LD decayed faster in Pop-A, reaching a level of 0.14 around 500 kb, while in Pop-B at the same distance LD reaches a value of 0.18. A similar pattern was observed for most of the chromosomes, except for, and Oki1, Oki21, Oki28, where LD decay was slightly higher in Pop-B and Oki-10, where both populations presented a similar pattern of LD decay (Supplementary Fig. S1).

Figure 1
figure 1

Decay of average linkage disequilibrium (r2) over distance across the two farmed populations. Different color lines represent populations: Pop-A = green, Pop-B = magenta.

Principal component analysis (PCA) was employed to explore the clustering of individuals from the two populations. The PCA revealed two distinct clusters corresponding to population A and B. Principal components (PCs) 1, 2 and 3 jointly accounted for 39.7% of total variance, with PC 1 (26.1%) separating A and B; PC 2 (7.6%) and PC 3 (6%) not clearly discriminating populations. However, Pop-B produced a much more heterogeneous group with individuals spread along both PC 2 and 3 (Fig. 2). ADMIXTURE analysis revealed that the lowest cross-validation error was reached at K = 10 (Supplementary Fig. S2 and S3), showing 10 ancestral lineages for these two populations (Fig. 3). Pop-A individuals show high degree of mixed ancestry, being dominated by several ancestral lineages proportions among individuals. In contrast, Pop-B individuals tend to be dominated by single ancestral lineages among individuals, the represented lineages are quite different, which is consistent with a greater dispersion of Pop-B individuals in the PCA in comparison with Pop-A.

Figure 2
figure 2

Principal components analysis (PCA) of genetic differentiation among individuals. Three first components are shown. Each point represents one individual, and different colors represent populations: Pop-A = green, Pop-B = magenta.

Figure 3
figure 3

Individual assignment probabilities generated with ADMIXTURE (K = 10). Each color represents a cluster, and the ratio of vertical lines is proportional to assignment probability of an individual to each cluster.

Genome-wide scanning for selection signatures

To detect signatures of selection in the two coho salmon breeding populations, we used three statistical methods. Two of them are based in extended haplotype homozygosity (EHH): iHS and XP-EHH, and one method is based on the variation of the site-frequency spectrum, the composite likelihood ratio (CLR) tests.

Inner circles of Figs. 4 and 5 show the genome-wide distribution of iHS values. The iHS statistics revealed 156 and 99 SNPs with significant iHS scores (p ≤ 0.001) representing 0.18% and 0.12% highest values in populations A and B respectively. We found 26 and 16 regions in these respective populations with at least two significant SNPs that are ≤ 500 kb apart. These genomic regions spanned 17.1 Mb across chromosomes Oki5, Oki6, Oki28 and Oki29 in Pop-A, and 7.5 Mb in Pop-B in Oki5, Oki14, Oki15, Oki17, Oki18, Oki22 and Oki26 (Table S1). The highest values were present in Oki5 and Oki28 for Pop-A, and Oki14 and Oki18 for Pop-B. XP-EHH revealed 43 SNPs, which surpassed the significance threshold (Figs. 4B, 5B), representing the 0.05% highest values, 41 with positive scores and indicating selection in Pop-A, and two with negative scores indicating selection in Pop-B (Table S2). According to the criteria that we defined, six genomic regions spanning 3.72 Mb were detected in Pop-A, and one region of 520.8 kb in Pop-B. Genomic regions with XP-EHH values above 3 were detected in Oki1, Oki10 and Oki28 for Pop-A. The highest XP-EHH estimate was on Oki1 (XP-EHH = 3.8030, -log10(p value) = 3.8448), and the lowest value was on Oki15 (XP-EHH = -3.373854, log10(p value) = 3.130043). As a complementary method to iHS and XP-EHH, we performed the composite likelihood ratio (CLR). Figures 4C and 5C illustrate the CLR statistics against the genome positions for coho salmon. Based on a conservative 99.9% outlier threshold (CLRA ≥ 4.81, CLRB ≥ 4.64), we detected 95 regions affected by selective sweeps, covering 28.24 Mb and 21.92 Mb in population A and B, respectively (Table S3). In both populations, the results provided evidence of selective sweeps in several chromosomes across the genome. In Pop-A the highest CLR values were located on Oki14 (CLR = 9.98) within a region of 180 kb and on Oki17 (CLR = 9.35) within a region of 400 kb. While, in Pop-B a clear evidence of selective sweep is observed on Oki15 (CLR = 10.75), with a cluster of extreme signals in a genomic region of 780 kb.

Figure 4
figure 4

Circos plot of the global distribution of selection signatures across the genome in Pop-A. The circles from inside to outside illustrate: (A) -log10(p-value) of iHS (Blue); (B) −log10(p value) of XP-EHH (Purple) and (C) CLR values (Green). Values for each test surpassing the threshold (iHS and XP-EHH ≥ 3, and CLR ≥ 4.81) are highlighted in red.

Figure 5
figure 5

Circos plot of the global distribution of selection signatures across the genome in Pop-B. The circles from inside to outside illustrate: (A) −log10(p value) of iHS (Blue); (B) −log10(p value) of XP-EHH (Purple) and (C) CLR values (Green). Values for each test surpassing the threshold (iHS and XP-EHH ≥ 3, and CLR ≥ 4.64) are highlighted in red.

Functional characterization of genomic regions

A total of 1,118 and 812 genes within the identified genomic regions putatively subjected to selection were retrieved for Pop-A and Pop-B, respectively. Gene annotation of selected regions revealed several functionally important genes, such as kdm6a involved in the regulation of anterior/posterior body axis development in zebrafish50 or sec24d and robo1 which were associated to resistance against Piscirickettsia salmonis in farmed coho salmon populations from Chile51. Genes in genomic regions with the highest scores for iHS, XP-EHH and CLR are shown in Tables 1, 2 and 3 while the complete list of the genes found is presented in Table S4.

Table 1 Genomic regions and genes spanning the five strongest detected selection signatures by iHS in each population.
Table 2 Genome regions and genes spanning the detected selection signatures by XP-EHH in each population.
Table 3 Genome regions and genes spanning the five detected selection signatures by CLR in each population.

We further investigated the functions associated with the putative genes undergoing positive selection by analyzing over-represented GO terms and pathways using DAVID. The identified GO terms and pathways are shown in Table S5. A total of 21 and 13 GO terms were observed for A and B, respectively. This includes 13 terms for biological process (BP), 3 for molecular function (MF), and 9 for cellular component (CC). The BP category is shown in Fig. 6; most of these terms are associated with basic metabolic processes, such as Developmental Process, Multicellular Organismal Process, Single-Organism Process, Response To Stimulus, Developmental Process, Presynaptic Process and Growth, among other. Additionally, 11 pathways were enriched (Table S5) such as Vascular smooth muscle contraction, Endocytosis, Regulation of actin cytoskeleton and TGF-beta signaling pathway.

Figure 6
figure 6

Functional annotation of candidate selective sweeps genes. Gene ontology (GO) analysis performed for Biological process. Color intensity of circle shows the significance of GO term and the size of the circle represents the number of genes associated with the GO term.

Discussion

Selective events are the major mechanism driving differentiation of populations. In this study, we assessed the genetic structure and detected selection signatures in two domestic populations of coho salmon by using three complementary statistical methods. First, we investigated basic population genetics statistics. Genetic diversity estimates, assessed by observed and expected heterozygosity, were in the upper limit of those reported in other salmon species introduced for farming purposes in Chile, such as Atlantic salmon, in which heterozygosity levels of these populations have been shown to range from 0.266 to 0.3721,22. In addition LD, which can be informative of population demography, decays similarly to what has been observed on previous studies in farmed coho salmon population from Chile32, and more rapidly than farmed Atlantic salmon52. Thus, the higher levels of genetic diversity and the lower levels of LD in farmed coho salmon populations suggest less impact of artificial selection on genome diversity, likely due to a shorter period of time of domestication in comparison with Atlantic salmon aquaculture populations from Chile. Genetic structure analyses revealed a high level of admixture in both populations. It is well known that demographic forces, such as admixture can mimic the patterns of genetic variation expected under selection53, 54. Therefore, we cannot discard the presence of false positives in the detection of selection signatures in this study, due to a high degree of mixed ancestry of these populations.

Several methods have been developed for detecting selection signatures in the genome. These methods can be grouped into three categories based on the type of genetic information exploited: population differentiation, site-frequency spectrum and linkage disequilibrium approaches18. In the present study we implemented three different tests (iHS, XP-EHH and CLR) to comprehensively identify candidate regions of positive selection in coho salmon. As is shown in Fig. 7, different genomic regions were detected by these methods. CLR have the largest overall number of signals detected to be underlying selection, followed by iHS and XP-EHH. No overlap was observed among the three methods, however CLR and iHS (Fig. 7), showed five overlapping regions on Oki5, Oki6 and Oki14 spanning a total of 2.02 Mb. Low overlap among methods based on allele frequencies compared to those based on haplotype patterns has been observed in previous studies in Atlantic salmon21,23. Discrepancies between loci under putative selection detected by different methods are expected, given that they benefit from different sources of information from genome variation. For instance, iHS test has advantage in detecting selective sweep with variants at intermediate frequencies11, while XP-EHH is more powerful at detecting complete or nearly complete selective sweeps11.

Figure 7
figure 7

Genomic distribution of selection signatures detected by iHS, XP-EHH, CLR and per populations on all chromosomes.

On the other hand, a low overlap between the two populations was observed (Fig. 7). Four chromosomes showed common regions harboring selection signatures: Oki5 (1.86 Mb); Oki14 (946 kb), Oki16 (41.5 kb) and Oki26 (57.3 kb). We suggest this low overlap among populations might be the effect of breeding program goals, which predominantly are focused on improving polygenic traits (e.g., growth), therefore, different genomics regions on different populations are affected by selection across the genome. The population-specific signals suggest that selection may have acted upon different genes, which has been already shown in other species and can be interpreted as a polygenic basis of parallel selection traits55. Several studies in fish have suggested that the same phenotype may arise from different genetic pathways among populations21,23,56,57,58,59. Evolution of complex parallel phenotypes can indeed emerge from different evolutionary processes and this is likely to happen when inbreeding and genetic drift play a greater role than selection21. On the other hand, it is worth mentioning that a low coverage of the genome using a SNP array might have caused lacking of information of some important genomic regions harboring selection signatures. This issue might be addressed with the use of whole genome sequencing, which could strongly enhance the accuracy of selection signatures detection by increasing the density of SNPs covering a greater area of the genome60,61.

One of the goals of this study was to identify putative candidates genes involved in the domestication process and artificial selection of coho salmon. In accordance with the primary genetic improvement goal, where growth rate has been the main focus, we found a series of genes potentially relevant for this trait, including anapc2; alad; bdkrb2; fam98b; chp2; myn that were previously associated with body weight in a genome-wide association study in Atlantic salmon (GWAS)62 and sytl5, txnrd1 genes associated with growth traits in juvenile farmed Atlantic salmon63. The chp2 gene was also found to be putatively under selection in an Atlantic salmon domestic population selected for fast growth22. We also identified the kdm6a gene, which has been involved in the regulation of anterior/posterior body axis development in zebrafish50 and has been related to growth-related traits in pigs64. Furthermore, we identified the biological processes “Developmental Process” and “Growth”, within the significant terms associated with the genes harbored in the identified genomic regions, suggesting that loci controlling growth and development are most likely under selection in these farmed coho salmon populations.

On the other hand, host–pathogen interactions lead to strong selection in the genome of vertebrate species65. We suggest that adaptation to farming environment has imposed selection on genomic regions related to the immune system in coho salmon. Here we found the cnpy2, synrg and med10 genes, which were previously shown to be affected by parasite-driven selection in Atlantic salmon66. More importantly, we detected the genes sec24d and robo1 that have been associated with disease resistance in the face of a bacterial disease (Piscirickettsia salmonis) in coho salmon populations farmed in Chile51.

Artificial selection may be applied either inadvertently (unconsciously) or intentionally (consciously)67. For most livestock species, it is thought that the early stages of domestication involved unconscious selection for behavioral traits (e.g., for tameness and reduced aggression), followed by selection focused on breeding objectives68. In fish, behavioral traits such as swimming capacities, foraging, social interactions, reproduction, or personality and cognitive abilities, are also modified by domestication69. In the present study we identified the genes robo1 and dcdc2 which are associated to complex cognitive acquired skills, including spoken and written language in human70. We also identified the gtf2ird1 associated to aggression and social interaction in mice71 and recently shown to be under selection in Nile tilapia25. We suggest that such genes could be associated with behavioral traits in coho salmon as well.

Sexual maturation and reproductive traits have profound importance from both evolutionary and economic perspective. In fish farming, maturation is frequently delayed by exposing fish to continuous light or light regimes, which are different to those, found in natural environment, affecting the perception of seasonality and circannual rhythms72. In this study, we identified the picalmb and tsku genes with evidence of selection. These genes were recently associated to maturation traits in Atlantic salmon73. Furthermore, we also detected the vgll4b, which has previously shown signs of selection in Atlantic salmon farmed populations22. It is important to mention that a gene from the VGLL (vgll3) family is associated with age to maturity in natural populations of Atlantic salmon74,75.

Conclusion

This study reported the identification of selection signatures in the genome of aquaculture lines of coho salmon. Our analyses of two domestic populations revealed putative selection signatures of genomic regions containing genes involved in growth, immune system, behavior and maturation traits. As expected, these findings are congruent with the breeding goal for the populations studied here, which includes the most highly selected trait, growth rate. In addition, the effect of domestication and adaptation to captive environment, most likely affecting loci controlling traits that have not been directly part of the breeding objectives, including immunity and, behavior. These results provide further insights into the genome diversity changes driven by domestication and selection mechanisms in farmed coho salmon populations.