CSYseq: The first Y-chromosome sequencing tool typing a large number of Y-SNPs and Y-STRs to unravel worldwide human population genetics

Sofie Claerhout; Paulien Verstraete; Liesbeth Warnez; Simon Vanpaemel; Maarten Larmuseau; Ronny Decorte

doi:10.1371/journal.pgen.1009758

Abstract

Male-specific Y-chromosome (chrY) polymorphisms are interesting components of the DNA for population genetics. While single nucleotide polymorphisms (Y-SNPs) indicate distant evolutionary ancestry, short tandem repeats (Y-STRs) are able to identify close familial kinships. Detailed chrY analysis provides thus both biogeographical background information as paternal lineage identification. The rapid advancement of high-throughput massive parallel sequencing (MPS) technology in the past decade has revolutionized genetic research. Using MPS, single-base information of both Y-SNPs as Y-STRs can be analyzed in a single assay typing multiple samples at once. In this study, we present the first extensive chrY-specific targeted resequencing panel, the ‘CSYseq’, which simultaneously identifies slow mutating Y-SNPs as evolution markers and rapid mutating Y-STRs as patrilineage markers. The panel was validated by paired-end sequencing of 130 males, distributed over 65 deep-rooted pedigrees covering 1,279 generations. The CSYseq successfully targets 15,611 Y-SNPs including 9,014 phylogenetic informative Y-SNPs to identify 1,443 human evolutionary Y-subhaplogroup lineages worldwide. In addition, the CSYseq properly targets 202 Y-STRs, including 81 slow, 68 moderate, 27 fast and 26 rapid mutating Y-STRs to individualize close paternal relatives. The targeted chrY markers cover a high average number of reads (Y-SNP = 717, Y-STR = 150), easy interpretation, powerful discrimination capacity and chrY specificity. The CSYseq is interesting for research on different time scales: to identify evolutionary ancestry, to find distant family and to discriminate closely related males. Therefore, this panel serves as a unique tool valuable for a wide range of genetic-genealogical applications in interdisciplinary research within evolutionary, population, molecular, medical and forensic genetics.

Author summary

Around 95% of the male-specific Y-chromosome (chrY) is non-recombining and therefore inherited in a conserved manner from father to son. It can therefore serve as a powerful marker for interdisciplinary genetic-genealogical research as it provides a strong link between genetic information and a family tree or pedigree. While Y-chromosomal short tandem repeats (Y-STRs) discriminate close paternal kinships, single nucleotide polymorphisms (Y-SNPs) enables the identification of far evolutionary ancestry. Unfortunately, an extensive chrY-specific sequencing panel combining a large number of familial Y-STRs and evolutionary Y-SNPs was not yet available. Therefore, chrY is rarely included in research projects and not often linked to a genealogical, history-demographical or life science database. In this way, the importance of chrY still remains not yet fully understood. Massive parallel sequencing (MPS) allows the simultaneous analysis at sequence level of Y-SNPs and Y-STRs with variable mutation rates in a large number of males. However, up until today, no commercial kit is exploiting the full potential that MPS offers on chrY. Therefore, we developed the ‘CSYseq’, which is the first extensive chrY-specific sequencing panel. The CSYseq simultaneously identifies 9,014 slow mutating Y-SNPs to identify evolutionary ancestry, and 202 rapid mutating Y-STRs to investigate paternal relationships. We validated and optimized the panel through the analysis of 130 males distributed over 65 families. This novel MPS panel is useful for biogeographical identity and ancestry analysis, together with Y-chromosome profiling for the identification of patrilineages and discrimination of closely related males. As the CSYseq includes a very diverse set of markers that can be easily interpreted, it is interesting for different interdisciplinary applications within evolutionary, population, molecular, medical and forensic genetics.

Citation: Claerhout S, Verstraete P, Warnez L, Vanpaemel S, Larmuseau M, Decorte R (2021) CSYseq: The first Y-chromosome sequencing tool typing a large number of Y-SNPs and Y-STRs to unravel worldwide human population genetics. PLoS Genet 17(9): e1009758. https://doi.org/10.1371/journal.pgen.1009758

Editor: Takashi Gojobori, National Institute of Genetics, JAPAN

Received: November 19, 2020; Accepted: August 5, 2021; Published: September 7, 2021

Copyright: © 2021 Claerhout et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The PCR-CE Y-STR data supporting the findings of this study is available in the Y-STR Haplotype Reference Database (YHRD) at https://yhrd.org (accession numbers YA003651-53, YA003739-42 and YA004300-01). The raw sequence reads of the 816 amplicons targeted by the CSYseq cannot be shared publicly as this data contains potentially identifying participant information. Nevertheless, data will be made available upon request to the Ethics Committee Research UZ/KU Leuven (www.uzleuven.be/ethische-commissie/onderzoek).

Funding: This work was supported by the Catholic University of Leuven (KU Leuven, BOF-C1 grant number C12/15/013 - ML/RD and PDM/20/137 - SC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

For a long time, male-specific Y-chromosome (chrY) polymorphisms have been widely investigated for their distant and close paternal lineage identification in various fields such as anthropology, evolutionary biology, population genetics, genetic-genealogy and forensic sciences [1–4]. As 95% of chrY does not recombine with chrX (NRY), it is inherited from father to son in a conserved manner. However, passing on the Y-chromosome over generations allows DNA variation to be accumulated during spermatogenesis. Genetic chrY variation on the NRY is caused by DNA modifications, such as replication slippage or base pair (bp) substitutions. Commonly typed chrY modifications are single nucleotide polymorphisms (Y-SNPs) and short tandem repeats (Y-STRs) [5,6].

Y-SNPs are slowly mutating bi-allelic markers (on average 10⁻⁸ to 10⁻⁹ mutations per generation, mpg) with a single-base variation useful for predicting human ancestry and origins as well as studying evolutionary migration patterns [5,7–9]. They enable the reconstruction of a well-preserved male phylogenetic tree divided into 20 main Y-haplogroups (from ‘A’ to ‘T’) and currently more than 9,000 Y-subhaplogroups [10]. Some Y-SNPs were identified more recently, which means that they can be attributed to a specific population or even a single family [6]. In 2014, Scozzari et al. sequenced approximately 1.5 Mb of the NRY using 68 unrelated males covering all major Y-haplogroups. They discovered eight private substitutions causing amino acid changes in protein-coding genes and approximately 1,900 novel Y-SNPs [11]. To date, more than 700,000 Y-SNPs have been detected according to the ISOGG YBrowse database (International Society of Genetic Genealogy human Y-chromosome Browser, ybrowse.org/gb2/gbrowse/chrY), and high-throughput analyzing techniques such as next generation sequencing (NGS) ensure that this number is continuously increasing. Due to the growing Y-SNP discovery rate, the entire phylogenetic tree becomes more complex. Therefore, Van Oven et al. constructed in 2014 a minimal version of the Y-tree which includes 417 branch-defining Y-SNPs. These Y-SNPs define the key phylogenetic positions and human evolutionary lineages around the world [10].

The other commonly typed DNA markers are the Y-STRs, which are fast mutating (10⁻⁴ to 10⁻² mpg) multi-allelic variations. The high degree of variability is caused by DNA strand slippage during replication leading to an increase or decrease of the number of tandem repeats [12,13]. As a difference in one locus is sufficient to distinguish two close relatives, it is interesting to genotype multiple rapidly mutating (RM) Y-STRs [14–17]. In 2010, Ballantyne et al. identified 13 RM Y-STRs with a 6.5-fold higher mutation rate. These RM Y-STRs individualize more than 99% of 12,272 unrelated males from 111 worldwide populations and introduce a higher degree of haplotype diversity on a global scale [14]. Among these, there are multi-copy Y-STRs located in the palindromic regions of chrY [16,18]. Since mutation probability is higher across these Y-STRs [19], the level of discrimination between close paternally related individuals can be enhanced.

Analyzing both Y-SNPs and Y-STRs is interesting for interdisciplinary genetic-genealogical research and human population genetics. An example of its purpose in investigative genetic-genealogy is the pioneer solved cold case of Marianne Vaatstra in The Netherlands [2]. In this case, slow mutating Y-SNPs were genotyped in order to identify the Y-subhaplogroup and biogeographical origin of the perpetrator. This indicated that the murderer of Marianne was not an Asylum seeker, as was assumed in the village, but someone from the local area. Second, faster mutating Y-STRs were used later to find relatives of the perpetrator through a mass screening of male volunteers from the neighborhood. Genotyping slow mutating Y-STRs increased the chance of success to find a relative, but on the other hand, including rapidly mutating Y-STRs increased the discrimination power to distinguish two close relatives. For interdisciplinary genetic-genealogical research, including Y-SNPs is interesting because they could be important indicators for kinships (private and genealogical Y-SNPs), biogeographical origins and complex human traits [20]. The latter has already been confirmed in literature for complex human traits such as infertility, immune responses, cardiovascular risk and even COVID-19 mortality [21,22]. Complementary, Y-STRs are valuable to decrease false positive kinships, to confirm close biological family, to study their recent common ancestor relatedness and to differentiate between related and non-related males [23]. In human population and evolutionary genetics, the combination of Y-SNPs and Y-STRs enabled to analyze haplogroup-specific Y-STR mutation rates [24], recent and past migration events [25], biogeographical genetic variation [26], extra-pair paternity [27], network analysis within populations [28] and even correlations with socio-cultural factors [29].

Until now, chrY genotyping was mainly based on fragment analysis for Y-STRs or a single-base extension (SBE) assay for Y-SNPs using capillary electrophoresis (CE). But, CE has its limitations that can theoretically be overcome by high-throughput massive parallel sequencing (MPS) technology. First, Y-STRs of similar allele size but with a different sequence, called isoalleles, cannot be distinguished with CE. This results in unreported genetic variation between individuals or hidden parallel Y-STR mutations (PM) within genealogical pairs [30]. As MPS offers the ability to target and analyze DNA at sequence level, isoalleles can be distinguished, intra-repeat SNPs can be detected and new unique allelic variants of known STRs can easily be identified [31]. Second, due to spatial and spectral CE resolution, only a limited number of markers can be analyzed simultaneously resulting in the need to develop different multiplexes [2,32]. Currently, the two most comprehensive commercial CE-kits for Y-STR DNA profiling are the PowerPlex Y23 (23 Y-STRs, Promega) and the Yfiler Plus PCR Amplification Kit (27 Y-STRs, Applied Biosystems) [33]. With MPS, a large number of markers (both Y-SNPs and Y-STRs) can be analyzed simultaneously, reaching a higher discrimination capacity and wider range of applications [32,34]. To date, several MPS panels are already commercialized for SNP identity and ancestry analysis as well as STR marker DNA profiling.

Thermo Fisher Scientific was the first company to develop commercial kits for second-generation sequencing, with the Ion Torrent HID STR 10-plex being the first kit for autosomal STR genotyping [35]. In 2015, the kit was upgraded to the Early Access STR Kit v1, which was able to detect 25 autosomal STR loci [36]. Both kits are compatible with their Ion PGM platform. Also in 2015, Illumina developed the first targeted NGS panel, called the ForenSeq DNA Signature Prep kit, which targets alongside 58 STRs (including 27 autosomal STRs, 24 Y-STRs and 7 X-STRs) also 172 autosomal SNPs (94 identity SNPs, 56 ancestry SNPs and 22 phenotypic SNPs) [37]. For this kit, Illumina developed the MiSeq FGx System, which includes data analysis software [38] that provides investigators with additional genetic variation information [39]. Shortly after, Thermo Fisher Scientific developed the HID-Ion AmpliSeq Identity Panel to target 124 different SNPs (including 90 autosomal SNPs and 34 Y-SNPs), but no STRs. With this panel, the biogeographical ancestry can be determined through chrY analysis using the HID-Ion PGM system [40]. More recently, in 2019, Thermo Fisher Scientific commercialized the Ion AmpliSeq HID Y-SNP Research Panel v1, targeting 859 phylogenetic Y-SNPs where 640 Y-haplogroups can be determined [41]. The latter kit contains the largest number of Y-SNPs so far, but there is still no MPS panel that targets both evolutionary Y-SNPs as familial Y-STRs.

MPS offers the combination of sequencing large numbers of samples and markers while providing single-base sequencing information. The present study focusses on the development of the first extensive chrY-specific MPS panel, called the ‘CSYseq’. This newly developed panel targets a large number of phylogenetic informative Y-SNPs and multiple Y-STRs in a single assay. All Y-polymorphisms included in the panel were analyzed and investigated on their ease of interpretation, depth of coverage, discrimination power, mutability and chrY specificity.

Results

To create our chrY-specific MPS panel, regions of interest containing known Y-STRs and reported Y-SNPs were carefully chosen based on literature (see Materials and Methods). We preselected 865 defined chrY regions (39,126 bp) containing 251 Y-STRs and 772 phylogenetic informative Y-SNPs to cover the entire Minimal reference Y-tree [10]. Primer pairs were designed using DesignStudio by Illumina to create the most optimal panel. They provided us with a list of amplicons and chrY positions that our panel would target in theory. In the results below, this theoretical version of the CSYseq is compared to the output of the CSYseq panel after sequencing: theory versus practice.

In theory, our custom made panel developed by DesignStudio, targets 857 fragments with an average length of 248 bp (range: 225–275 bp). This panel covers 209,248 bp distributed over the euchromatic chrY region and is able to genotype 228 known Y-STRs and 757 phylogenetic informative Y-SNPs. Not all our initially selected Y-STRs and Y-SNPs were included in the amplicon selection made by DesignStudio. This can be due to low primer specificity or undesignable primer sets to avoid Y-SNPs in the primer positions (1000 Genomes as variant source) or a combination of both. The defined amplicon length of 250 bp might be the limiting factor in the selection of the two flanking primers for the assay. In total, 94% of our initial target region is covered by DesignStudio. According to Illumina, a custom design of a TruSeq kit results in at least 70% specificity and 80% coverage of the target regions. Yet, with the CSYseq we reached a coverage of more than 90%.

In practice, after sequencing 130 males, the number of paired-end reads per library was between 346,314 and 2,855,636 (average 818,160 reads). Of the 857 amplicons provided by DesignStudio, 28 amplicons (3.3%) were not sequenced or contained a low depth of coverage, 7 included no known Y-polymorphism, 13 were only partially sequenced (some Y-SNPs were typed, but not the entire Y-STR) and the remaining 809 amplicons provided full sequence reads. The amplicons not containing a known Y-polymorphism is probably a result of low primer design specificity, with the oligos binding on other chrY regions than intended by DesignStudio. Additionally, other genomic regions were targeted due to the sequence homology of several CSYseq primers. Some CSYseq primers aligned on duplicated Y-chromosomal positions or on other chromosomes. The number of aligned reads of all samples and the target chromosome distribution are sorted on alignment percentage with chrY, visualized in a heat map (S1 Fig). Obviously, most reads aligned with chrY (67.1%, SD = 9.0%), followed by chrX (4.1%, SD = 0.3%) and chr2 (2.4%, SD = 1.1%). This homology does not affect the results of the CSYseq as our Y-SNP and Y-STR data analysis is sequence-specific which takes flanking and repeat regions into account to filter out homology. The total number of chrY aligned paired-end reads per sample was on average 400,924 reads (range: 67,609–2,573,191 reads). This was nearly twice the depth of coverage (average 250,000 reads) that was necessary to obtain at least 150 single-end reads per amplicon.

Y-SNPs as evolutionary markers

Y-SNPs are slowly mutating bi-allelic markers used to reconstruct a human phylogenetic tree, to predict ancestral origins and to study evolutionary migration patterns [5,7–9]. Based on the ISOGG YBrowse database (2019–2020), the 841 designed amplicons would target 13,812 known Y-SNPs (http://ybrowse.org/gb2/gbrowse/chrY). As reported by the ISOGG YBrowse database, they can be further divided into 5,927 Y-SNPs with a still unknown phylogenetic position and 7,885 haplogroup-specific Y-SNPs where 30 Y-SNPs have been identified as private SNPs. These haplogroup-specific Y-SNPs define 1,212 unique Y-subhaplogroups (covering 96% of the Minimal Y-tree) [10].

In practice, sequencing 130 males with our CSYseq panel and data analysis with Yleaf [42] successfully enabled the identification of 15,611 Y-SNPs. As reported by the ISOGG YBrowse database (2019–2020), they can be further divided into 6,597 Y-SNPs with a still unknown phylogenetic position and 9,014 evolutionary haplogroup-specific Y-SNPs where 32 Y-SNPs have been identified by ISOGG as private Y-SNPs. The haplogroup-specific Y-SNPs target 1,443 unique Y-subhaplogroups including all main haplogroups (from ‘A’ to ‘T’) divided across the entire phylogenetic tree (Table 1). The output of the panel covers 445 haplogroups (97%) of the 458 haplogroups included in the Minimal Y-tree. The 13 Y-subhaplogroups not covered by the panel are B1, C1b1a1a1a1a, I1a2a1a1d1a1a2b1a, K1a, K1b, K2a1a, K2b2, M3, N1a1a1a1a1a6, R1b1a1b1a1a2a1b1a, R1b1a1b1a1a2a2, R1b1a1b1a1a2b3b and S1a2. In total, 129 samples covered the 445 haplogroups and one sample targets only 403 Y-haplogroups. The latter sample was also observed to have the lowest output number of Y-SNPs (7,284) and the lowest chrY alignment percentage (13%). Even though this was a challenging sample, it is still able to target 88% of the haplogroups included in the Minimal Y-tree [10]. In Table 1, it can be observed that the CSYseq contains an equal subhaplogroup distribution per main haplogroup compared to the Minimal Y-tree [10]. A complete phylogenetic tree including all CSYseq typed Y-subhaplogroups can be found in S1 Table within the Supporting Information file.

Download:

Table 1. CSYseq Y-SNP and subhaplogroup coverage.

https://doi.org/10.1371/journal.pgen.1009758.t001

Targeted Y-SNPs contained between 10 and 5,218 reads per sample with an average of 717 reads. For the 65 non-related samples, the total number of reads per Y-SNP ranged from 10 till 339,189 with 70% between 10,000 and 100,000 reads (Fig 1A). On average, there are 12,281 Y-SNPs typed per sample and even the least extensive sample still contained 7,284 well-typed Y-SNPs (Fig 1B). The number of typed Y-SNPs per sample was significantly correlated with the total number of reads identified in the FASTQ files (p = 4.33×10⁻⁷) and the number of reads aligned against chrY (p = 8.85×10⁻⁴⁰) (Fig 1B). This was as expected, since the more reads the sample has in total (FASTQ) or aligned with chrY, the more reads it has per chrY amplicon containing the Y-polymorphisms of interest. Sample quality statistics revealed a slightly significant (at the margin of statistical significance) correlation between typed Y-SNPs with initial chrY concentrations measured before library preparation (p = 1.30×10⁻³) and their degradation index (DI, p = 1.27×10⁻²) (see section ‘CSYseq robustness’, S2 Fig). When MPS output is compared to the limited Y-SNP panel typed by the SBE SNaPshot PCR-CE technique used in most laboratories, a successfully deeper Y-SNP subhaplogroup was genotyped for 66% of the samples due to the massive number of typed Y-SNPs (Fig 1C). On average, four phylogenetic branches deeper were detracted in which a maximum of ten branches was observed: from ‘R1a1a’ (R-M198) with SNaPshot-CE to ‘R1a1a1b1a3a2b2b’ (R-AM00559) with MPS. In 32% of the samples, both techniques resulted in the same final derived Y-SNP, but for two samples with subhaplogroups ‘J-M92’ and ‘R-L2’, SNaPshot-CE surpassed MPS in typing one branch deeper. For these latter two cases, the CSYseq panel was able to sequence both final Y-SNPs, but the markers did not pass the selected sequencing criteria of at least 10 reads and the base calling percentage of 90%. This is sample specific and not a limitation of the panel. J-M92 was observed to be typed in 90 samples with an average depth of coverage of 43 reads. And R-L2 was typed in all the other samples with a high average depth of coverage of 1,095 reads.

Download:

Fig 1. CSYseq targeted Y-SNPs.

A. The 15,611 genotyped Y-SNPs with number of paired-end reads for all samples (threshold = 10 reads). B. Correlation between typed Y-SNPs and the total reads per sample obtained from FASTQ files (▲) and chrY alignment (). C. Number of typed Y-SNP subhaplogroup branches through SNaPshot-CE (black) compared to MPS (grey) sorted by their main Y-haplogroup.

https://doi.org/10.1371/journal.pgen.1009758.g001

The female sample revealed output for 399 Y-SNPs, where 205 Y-SNPs (51%) have an unknown phylogenetic position. Detailed Y-SNP analysis revealed no unambiguous haplogroup determination. Although the depth of coverage of these Y-SNPs was above the threshold of 10 single-end reads, they only had a median of 31 reads. Additionally, for all 399 Y-SNPs, the average reads obtained for all male samples were 13 times higher compared to the female sample. This is valuable information for forensic genetics to set a valuable read threshold when DNA mixture analysis needs to be performed.

Y-STRs as patrilineage markers

Y-STRs provide a high degree of variability due to their fast mutating multi-allelic variations. This makes them highly interesting for population genetics. In theory, the 214 designed amplicons of our CSYseq panel target a total number of 228 Y-STR loci. In practice, after sequencing and Y-STR data analysis with FDSTools, 28 Y-STR loci were excluded from the panel due to not being sequenced (n = 20), low depth of coverage (n = 7) or strong chrX homology which could not be filtered out due to strong sequence similarities (n = 1). This was as expected and can be explained by the 90% primer design success rate of DesignStudio (see before). The excluded Y-STR loci with detailed information are listed in S2 Table. Through additional analysis of the high quality sequenced chrY reads with Tandem Repeat Finder (TRF) [43], two novel Y-STR loci were identified which are sequenced by the CSYseq panel. As no information about these specific Y-STRs is yet available in literature or within the ISOGG YBrowse database, they were named CSY1 and CSY2. Further, CSYseq analysis for the double sequenced male sample exposed equal data output and the female sample revealed no output. This indicates that our CSYseq panel is chrY-specific and possible output allele calls as a result of chrX homology were successfully filtered out using our in-house created ‘CSYseq.analYzer’ tool (see Materials and Methods).

In total, the CSYseq panel covers 202 well-targeted Y-STR loci. Table 2 provides detailed information concerning their repeat motif and discrimination capacity. HGVS nomenclature and Y-chromosome positions of these Y-markers can be found in S3 Table. The 202 Y-STRs from the CSYseq panel include 15 Y-STRs from the commercially available CE kits (PowerPlex Y23 and Yfiler Plus): DYS19, DYS389I/II, DYS390, DYS391, DYS392, DYS448, DYS456, DYS635, Y-GATA-H4, DYS533, DYS549, DYS570, DYS643 and DYS460. 17 Y-STRs targeted by the CSYseq are also present in the commercial kits developed for MPS (ForenSeq and PowerSeq): DYS19, DYS389I/II, DYS390, DYS391, DYS392, DYS448, DYS456, DYS460, DYS522, DYS533, DYS549, DYS570, DYS612, DYS635, DYS643 and Y-GATA-H4. As an internal control, 21 Y-STR loci sequenced by the CSYseq were compared to previously obtained PCR-CE results from our in-house YForGen kit (46 Y-STRs) and commercial Y-kits [44]. We observed that all Y-STR allele calls were in accordance with our previous results, which confirms that the results of MPS are reliable. A total of 188 Y-STRs are simple Y-STRs with one variable repeat motif, while 14 Y-STRs contain a more complex double repeat. For example, ‘DYS463’ exists of both AAAGG[n] and AAGGG[n] as variable repeat motifs which were easily discriminated using FDSTools. Furthermore, 156 Y-STR loci are single-copy (SC) Y-markers, whereas the other 46 are multi-copy (MC) Y-markers, including three Y-STRs with four loci (-abcd). For most MC Y-STRs, it remained difficult to discriminate the different loci due to sequence similarities of the flanking and repeat regions. The results of the MC Y-STR loci with indistinguishable genome alignment were grouped together for further analysis. Equal to CE analysis, if the exact sequence per locus remains unknown, we sort them from short to long Y-STR allele call. This makes it possible to still perform Y-STR mutation analysis using the principle of Parsimony: the least number of changes indicates the most likely event. DNA sequences of all included Y-STRs are publicly available in the ISOGG YBrowse database (http://ybrowse.org/gb2/gbrowse/chrY).

Download:

Table 2. Detailed information concerning all 202 CSYseq targeted Y-STRs.

https://doi.org/10.1371/journal.pgen.1009758.t002

The CSYseq targets 57 di-, 38 tri-, 83 tetra-, 22 penta- and 2 hexanucleotide repeats. For autosomal DNA analysis, the two genuine allele calls of dinucleotide STRs can be difficult to interpret due to their stutter fragments. For Y-chromosomal STR analysis, this is different as it mostly results into one allele call due to its haploid nature. For the single-copy dinucleotide Y-STRs, stutter fragments and the true allele call can easily be identified using FDSTools. But the panel does include six multi-copy dinucleotide Y-STRs that cannot be distinguished by sequence variance in the flanking regions. These Y-STRs resulted in multiple stutter peaks. Therefore, a more complex separation of stutter alleles and genuine heterozygous alleles was necessary. The reported difficulties with these Y-STR stutters were taken into account within the CSYseq.analYser file. This file sorts the sequences according to their number of reads to additionally filter out all stutters. An example of the different allele and stutter output combinations to interpret multi-copy dinucleotide YCAII-ab within our sample is provided in S4 Table.

The average number of reads per Y-STR locus was 150 reads, ranging from 5 (DYS448) to 619 reads (TRF17200) (Fig 2A). Only 11 Y-STR loci had an average number of reads below 10, though eight of them were genotyped for the majority of the samples. These markers may have insufficient coverage for challenging forensic samples, but this needs to be confirmed by future research. However, they can still be interesting to include into the panel for genetic-genealogy purposes and mass-screening for forensic familial searching. 137 Y-STRs are well-typed in more than 90% of the samples, 54 Y-STRs in 60 to 90% and 11 Y-STRs in 30–60% of the samples. The number of Y-STRs typed per sample is visualized in Fig 2B. On average, there are 184 Y-STRs typed per sample and even the least extensive Y-haplotype still contained 115 well-typed Y-STRs. There was no significant correlation between the number of typed Y-STRs and the number of reads within their FASTQ file (Fig 2C). However, the number of typed Y-STRs was observed to correlate significantly with the number of reads aligned against chrY (p = 6.90×10⁻³). Sample quality statistics revealed a slightly significant (at the margin of statistical significance) correlation between typed Y-STRs and the initial chrY concentrations measured before library preparation (p = 2.80×10⁻²), but not with the DI (see section ‘CSYseq robustness’, S2 Fig).

Download:

Fig 2. CSYseq targeted Y-STRs.

A. The average number of reads of the 202 targeted Y-STR loci of the CSYseq panel. B. The number of typed Y-STRs per sample. C. Correlation between the typed Y-STRs per sample and the total number of reads per sample obtained from FASTQ files (▲) and chrY alignment (×). D. A heatmap visualizing the allele ranges and frequencies per Y-STR.

https://doi.org/10.1371/journal.pgen.1009758.g002

All 202 Y-STR loci were investigated in detail using GenAlEx to determine the allele call frequencies with allele ranges (Fig 2D), discrimination capacity, and average repeat sizes (Table 2). The 14 Y-STR loci having multiple variable repeat units were divided into -M1 and -M2. The smallest variable repeat size contained only two repeats (DYS452-M2, DYS635-M2 and DYS19-M2), while the largest repeat number observed contained 30 repeats (DYS612). Detailed double repeat sequence variability with their allele call frequencies for the 14 variable complex and compound Y-STRs can be found in S5 Table within the Supporting Information file. Average discrimination capacity was 0.69 for complex and compound Y-STRs and 0.44 for simple Y-STRs. For 11 Y-STRs, no allele diversity was observed between the samples included in this study, wherefore consequently a discrimination capacity of zero was calculated.

Y-STR mutation analysis was conducted through Y-haplotype comparison between male relatives within the genealogical pairs and deep-rooting pedigrees. A detailed overview of the mutation statistics per Y-STR loci are listed in Table 3. The number of generations covered per Y-STR loci are on average 1,083 meioses and fluctuated between 218 and 1,279 meioses. This fluctuation can be explained by the fact that some Y-STR markers were not successfully typed in all samples. A total number of 910 Y-STR differences was observed over 214,859 allele transfers (Table 3). In total, 759 one-step, 98 two-step and 53 multi-step differences were observed. For 66 Y-STRs, no allele call differences within the sequenced genealogical pairs were observed. The mutation rates of the other 136 Y-STRs are listed with their 95% confidence interval (CI) in Table 3 and visualized in Fig 3A. An overall average mutation rate of 4.57×10⁻³ mpg (95% CI: 4.29×10⁻³–4.86×10⁻³) was observed for the CSYseq panel. When we exclude the Y-STRs without an observed mutation in our study, an average mutation rate of 6.64×10⁻³ mpg was obtained with a minimum of 4.15×10⁻⁴ mpg (DYS371-abcd) and a maximum of 4.13×10⁻² mpg (TRF14783). The mutating Y-STRs can be subdivided into 15 slow mutating Y-STRs (<10⁻³ mpg), 68 moderate mutating Y-STRs (≥10⁻³ to <5×10⁻³ mpg), 27 fast mutating Y-STRs (≥5×10⁻³ to <10⁻² mpg) and 26 rapid mutating Y-STRs (≥10⁻² mpg, Fig 3A, red line) [45]. The individual mutation rates of 101 Y-STRs were compared to literature [14,17,46–48]. In total, 95% of these Y-STRs were in accordance with literature, which means that a significantly different mutation rate was observed for only five Y-STRs (DYS390, DYS490, DYS525, DYS606 and DYS612). For the influencing molecular factors, a significant positive correlation between the individual Y-STR mutation rates with the average allele size (number of repeats) (p = 8.34×10⁻¹⁰) was observed (Fig 3B). Mutability rates had no significant difference between simple, compound or complex repeat Y-markers (Fig 3C). Further, significant differences were identified in the mutation rates between di-, tri-, tetra- and pentanucleotide Y-STRs, but no significant difference between tri- and tetranucleotide Y-STRs nor a linear correlation was observed (Fig 3D).

Download:

Fig 3. CSYseq Y-STR mutation analysis.

A. Individual Y-STR mutation rates with their 95% CIs. The RM Y-STR treshold (10⁻² mpg, dashed line) and average mutation rate (4.57×10⁻³ mpg, black line) are indicated. B. Positive significant correlation between the Y-STR mutation rate and the average number of repeats. C. The average Y-STR mutation rates of the different repeat motif types (simple, compound and complex). D. Average mutation rates per length of the repeat unit (bp).

https://doi.org/10.1371/journal.pgen.1009758.g003

Download:

Table 3. CSYseq Y-STR mutation analysis.

https://doi.org/10.1371/journal.pgen.1009758.t003

Through detailed mutation analysis of complex and compound Y-STRs, it was observed that Y-STR differences occurred more frequently within the longest repeat sequence. For example, in DYS725-abcd, which has a compound repeat structure being GT[n]GTCT[n], the average number of repeats is respectively 20 and 4, and the observed number of mutations per motif is 32 and 17. Furthermore, we observed that three markers (TRF10691, DYS463 and DYS725) showed allele call differences in five genealogical pairs for both variable motifs at the same time. For three couples, these differences were found on DYS725 (NC_000024.10:g.24738202) e.g. one relative had GT[19]GTCT[4], while the other contained GT[20]GTCT[5] which reveals two independent one-step mutations in parallel. The other two genealogical pairs contained both a parallel mutation which would have remained hidden through CE as the two mutations resulted in the same allele call: 22 repeats for DYS463 (NC_000024.10:g.7775468) with AAAGG[7]AAGGG[15] ↔AAAGG[8]AAGGG[14] and 25 repeats for TRF10691 (NC_000024.10:g.15550131) with TG[21]N[10]TG[4] ↔TG[20]N[10]TG[5].

Through comparison analysis of the 202 Y-STR loci, it was possible to distinguish all non-related and related males, providing 130 unique Y-haplotypes. Using the Y-STR differences observed over the 136 mutating loci, the CSYseq succeeded in making a distinction between all paternally related males. On average, they were separated by 18 generations and discriminated by 13 Y-STR changes. A minimum number of four Y-STR differences was observed for a couple separated by 18 meioses, whereas a maximum of 22 Y-STR changes for two couples could be observed separated by 21 and 29 meioses. No significant correlation was observed between the number of generations and the number of mutations. This can be explained by the inclusion of fast and rapid mutating Y-STRs in the CSYseq panel and by the occurrence of back and parallel mutations which increases with the generational distance within genealogical pairs [23].

CSYseq robustness

A schematic overview of the MPS library quality and chrY data analysis steps is provided in S2A Fig. The TruSeq Custom Amplicon Low Input kit (Illumina, San Diego, CA, USA) recommends a DNA input of 10 ng and DNA concentration of 2.5 ng/μl [49]. The chrY DNA input concentration of all sequenced samples measured using PowerQuant qPCR was between 1.75 and 17.58 ng/μl (average 5.69 ng/μl) with a degradation index (DI) from 0.90 to 4.79 (average 1.92; S2B Fig). No significant correlation between chrY concentration with DI could be observed. Five samples did not fulfil the recommended input concentration from which two samples had a DI exceeding the manufacturer’s threshold of 2 [50]. The samples encountering the highest DI (4.79) and the lowest chrY concentration (1.75 ng/μl) are respectively indicated by the labels ‘d1’ and ‘c1’ throughout S2 Fig.

Library preparation quality control measured by the 2100 BioAnalyzer indicated a library peak size for all samples between 357 and 397 bp (average 380 bp) and a library concentration between 0.01 and 8.82 ng/μl (average 1.6 ng/μl; S2C Fig). BioAnalyzer library concentrations showed a significant correlation with the initial identified chrY concentrations (p = 4.09×10⁻³), but not with the DI. Additionally, normalized KAPA SYBR qPCR Ct values (S2D Fig) also revealed a significant correlation with the chrY concentrations (p = 1.27×10⁻⁶), but are only slightly significant with the DI (p = 0.006, R² = 0.062). As expected, normalized KAPA qPCR Ct values correlated significantly with the BioAnalyzer library concentrations (p = 1.24×10⁻¹⁰, R² = 0.293). The number of FASTQ file reads per library output after sequencing (S2E Fig) did not significantly correlate with both the initial chrY concentrations and the DI due to library normalization.

FASTQC software [51] flagged 63 samples with high per sequence base quality, meaning that, for both paired-end reads, the lower quartile of the first 150 bp did not have a FASTQC quality Phred score below 20. For all samples, the read position where FASTQC Phred scores went below 20 ranged from 85 to 278 bp (average 172 bp; S2F Fig). Again, only a slightly significant correlation could be observed with the initial chrY concentrations (p = 2.54×10⁻³), but not with the DI. Besides, the percentage of read alignment against GRCh37/hg19 reference genome and chrY also turned out to be only significant with the input chrY concentrations (p = 4.33×10⁻³ and 6.16×10⁻⁴) and not the DI. Remarkable was that both samples with the highest DI (4.79 and 4.69) showed a high and low quality. The FASTQC Phred score below 20, defining low per sequence base quality, are respectively at 258 bp and 104 bp and they have a chrY alignment of 61% and 36%. However, a high number of Y-markers (12,709 and 12,565 Y-SNPs; 186 and 192 Y-STRs) could still be sequenced (S2F and S2G Fig). In general, the number of typed Y-SNPs using Yleaf (S2G Fig) showed a slightly significant correlation with the input chrY concentrations (p = 1.30×10⁻³) and DI (p = 1.27×10⁻²). Yet the number of typed Y-STRs using FDSTools (S2H Fig) only turned out to be slightly significant with the input chrY concentrations (p = 2.80×10⁻²), but not the DI. Therefore, the success rate of the CSYseq panel was not clearly observed to be influenced by the initial chrY concentration or degradation index of a sample.

Further, we focus on sample d1 with the highest degradation index (4.79) and c1 with the lowest initial chrY concentration (1.75 ng/μl), both indicated throughout S2 Fig. We observed that they both encountered low concentrations after library preparation measured by the BioAnalyzer and KAPA qPCR (S2C and S2D Fig). Surprisingly, they both contained relatively high paired-end read outputs of respectively 1,065,926 and 939,462 reads. The paired-end read alignment against the GRCh37/hg19 reference genome differed strongly with respectively 85% and 25% and for chrY alignment this was respectively 61% and 16%. Consequently, d1 contained an overall higher FASTQC quality Phred score (from 258 bp below 20) compared to c1 (from 85 bp below 20; S2E and S2F Fig). As a result, the number of typed Y-SNPs was slightly higher for d1 (12,706 Y-SNPs; 81%) compared to c1 (11,139 Y-SNPs; 71%; S2G Fig). But, remarkably, with the high number of typed Y-SNPs, they both still resulted in well-typed deep Y-subhaplogrouping. The CSYseq kit even added eight branches in c1 from ‘I2a1b1’ with CE to ‘I2a1b1a2b1a1a1’. Additionally, a high number of typed Y-STRs for both d1 and c1 was still possible, respectively being 186 Y-STRs (92%, average = 245 reads per Y-STR) and 182 Y-STRs (90%, average = 51 reads per Y-STR; S2H Fig).

Discussion

In this study, we present the ‘CSYseq’ panel which allows the identification of 9,014 phylogenetic Y-SNPs and 202 interesting Y-STRs through massive parallel sequencing (MPS). We sequenced one female sample and 130 males from the Low Countries (Belgium or the Netherlands) distributed over 65 different paternal pedigrees. This enabled us to analyze and investigate all Y-polymorphisms included in the CSYseq panel on their ease of interpretation, depth of coverage, discrimination power, mutability and chrY specificity.