Next Article in Journal
Phylogeographic Diversity Analysis of Bipolaris sorokiniana (Sacc.) Shoemaker Causing Spot Blotch Disease in Wheat and Barley
Next Article in Special Issue
Body Fluid Identification in Samples Collected after Intimate and Social Contact: A Comparison of Two mRNA Profiling Methods and the Additional Information Gained by cSNP Genotypes
Previous Article in Journal
Proteomic Markers in the Muscles and Brain of Pigs Recovered from Hemorrhagic Stroke
Previous Article in Special Issue
Routine Mitogenome MPS Analysis from 1 and 5 mm of Rootless Human Hair
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Open-Access Worldwide Population STR Database Constructed Using High-Coverage Massively Parallel Sequencing Data Obtained from the 1000 Genomes Project

by
Tamara Soledad Frontanilla
1,†,
Guilherme Valle-Silva
2,†,
Jesus Ayala
3 and
Celso Teixeira Mendes-Junior
2,*
1
Departamento de Genética, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto 14049-900, SP, Brazil
2
Departamento de Química, Laboratório de Pesquisas Forenses e Genômicas, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto 14040-901, SP, Brazil
3
Facultad de Ingeniería Informática, Universidad de la Integración de las Americas, Asunción 00120-6, Paraguay
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Genes 2022, 13(12), 2205; https://doi.org/10.3390/genes13122205
Submission received: 15 October 2022 / Revised: 13 November 2022 / Accepted: 21 November 2022 / Published: 24 November 2022
(This article belongs to the Special Issue Advances in Forensic Molecular Genetics)

Abstract

:
Achieving accurate STR genotyping by using next-generation sequencing data has been challenging. To provide the forensic genetics community with a reliable open-access STR database, we conducted a comprehensive genotyping analysis of a set of STRs of broad forensic interest obtained from 1000 Genome populations. We analyzed 22 STR markers using files of the high-coverage dataset of Phase 3 of the 1000 Genomes Project. We used HipSTR to call genotypes from 2504 samples obtained from 26 populations. We were not able to detect the D21S11 marker. The Hardy-Weinberg equilibrium analysis coupled with a comprehensive analysis of allele frequencies revealed that HipSTR was not able to identify longer alleles, which resulted in heterozygote deficiency. Nevertheless, AMOVA, a clustering analysis that uses STRUCTURE, and a Principal Coordinates Analysis showed a clear-cut separation between the four major ancestries sampled by the 1000 Genomes Consortium. Except for larger Penta D and Penta E alleles, and two very small Penta D alleles (2.2 and 3.2) usually observed in African populations, our analyses revealed that allele frequencies and genotypes offered as an open-access database are consistent and reliable.

Graphical Abstract

1. Introduction

Next-generation sequencing (NGS), also known as massively parallel or deep sequencing, is a technology that allows millions of DNA fragments to be sequenced in parallel. NGS can deal with several regions or targets simultaneously, enabling variation sites or mutations in the genome to be detected. This technology has allowed worldwide human genetic diversity to be studied for various purposes, including forensic human identification [1,2,3].
Advances in the genomics area have made it possible to use NGS techniques in a more accessible way, mostly because of lower costs. Currently, many researchers are performing whole-exome (WES) and even whole-genome (WGS) sequencing to estimate polygenic risk scores and probabilities of developing multifactorial diseases associated with various genetic regions at once, which would be a more laborious and costly issue if using traditional methodologies [1].
The 1000 Genomes Consortium is a worldwide collaboration that has produced an extensive catalog of human genetic variation. The consortium has sequenced whole genomes of 2504 individuals belonging to multiple populations derived from five population groups: African, East Asian, European, South Asian, and admixed Americans [4]. These data are freely available at the International Genome Sample Resource website (https://www.internationalgenome.org; accessed on 5 July 2021) to generate a variant call format file that uses a set of specific command lines [5]. In 2015, during Phase 3 of the Project, the consortium analyzed the genomes of all the individuals by using a combination of low-coverage whole-genome sequencing (WGS), deep exome sequencing, and dense microarray genotyping. The consortium described worldwide patterns of genomic diversity on the basis of Single Nucleotide Polymorphisms (SNPs), indels, and structural variants (SVs), including deletions, insertions, duplications, inversions, and copy-number variants (CNVs), but it did not analyze or study short tandem repeat (STR) markers in depth [6].
STR markers are crucial in human identification. These markers have high polymorphism levels and are particularly useful for interpreting mixtures of biological samples. However, in addition to the issue of small-sized amplicons, genotyping STR markers by using NGS data is difficult because alignment and stutter errors are frequent [7]. Achieving accurate genotyping by employing NGS data has been challenging because these data have high sequencing error rates [8]. Gymrek et al. (2012) managed to obtain and to analyze STR markers from the dataset of the 1000 Genomes Project using lobSTR [9]. Given that high coverage is mandatory for reliable STR genotype calling to be achieved, a primary concern regarding that study was that the data obtained from the 1000 Genomes Project available for lobSTR were generated by employing shallow sequencing coverage (2x–6x), so the calling was potentially susceptible to errors [10].
To circumvent this coverage issue, the New York Genome Center (NYGC) recently re-sequenced the 2504 samples of the panel of Phase 3 of the 1000 Genomes Project with high (30x) coverage, and aligned the sequence data to GRCh38. These publicly available data could be used to call STR markers reliably [5,11].
NGS technology allows dozens of STR markers to be analyzed together with different classes of markers that provide complementary contributions to population genetics and human identification. For example, including SNPs used as predictors of ancestry and phenotypic characteristics into commercial kits that employ capillary electrophoresis is unfeasible, but they can be combined with STR markers in NGS assays [1]. The problem with the NGS technology is the large amount of data generated and the lack of bioinformatic tools to analyze it [1]. Some tools (e.g., lobSTR [9], STRait Razor [12], toaSTR [13], and HipSTR [10], among others) were developed to analyze STR markers by using NGS data. Each tool employs different algorithms and flanking regions to capture STR reads.
Haplotype inference and phasing for STRs (HipSTR) was developed for calling microsatellites specifically from WGS Illumina FASTq files. HipSTR was designed to deal with genotyping errors and to obtain more robust STR genotypes. HipSTR accomplished this by learning locus-specific PCR stutter models, with the aid of an EM algorithm, by employing a specialized hidden Markov model to align reads to candidate alleles while accounting for STR artifacts, and by using phased SNP haplotypes to genotype and to phase STR markers. These factors turned HipSTR into one of the most reliable tools for genotyping STRs from Illumina sequencing data [10,14].
In contrast to other tools, HipSTR can process hundreds of samples at once. It also allows the user to determine the set of STR markers that must be analyzed and the flanking regions that must be used to capture them. In fact, previous studies showed that HipSTR provides accurate genotype calling. HipSTR accuracy was tested by comparing WGS calls from 118 samples to capillary electrophoresis data, which resulted in 98.8% consistency [10,15]. Recently, we compared HipSTR with Strait Razor and toaSTR, to find that the three tools present high allele calling accuracy (greater than 97%) [14]. Although data processing with HipSTR is more complex and requires bioinformatics knowledge and some nomenclature adjustments, this tool is currently the fastest and most appropriate to deal with larger datasets, including whole genomes [14].
In this investigation we conducted a comprehensive genotyping analysis of a set of STRs of broad forensic interest obtained from the 1000 Genomes populations, aiming to release a reliable open-access STR database that should contribute to future studies in the field of forensic genetics.

2. Materials and Methods

2.1. Genotype Calling

Genotypes were called from 2504 individuals belonging to 26 populations derived from five population groups analyzed by the 1000 Genomes Consortium, namely African (AFR), East Asian (EAS), European (EUR), South Asian (SAS), and admixed American (AMR) [4]. The NYGC re-sequenced the samples of Phase 3 of the 1000 Genomes Project in a high-coverage (30x) assay by applying the NovaSeq 6000 Sequencing System (Illumina, Inc.; San Diego, CA, USA) with a paired-end approach (2 × 150 bp). Then, the NYGC made the data freely available at https://www.internationalgenome.org/data-portal/data-collection/30x-grch38 (accessed on 10 July 2021).
We used CRAM files to obtain the STR genotypes with the aid of the HipSTR software [10]. We selected 22 autosomal microsatellites that are commonly used in forensic practice: CSF1PO, D1S1656, D2S441, D2S1338, D3S1358, D5S818, D7S820, D8S1179, D10S1248, D12S391, D13S317, D16S539, D18S51, D19S433, D21S11, D22S1045, FGA, Penta D, Penta E, TH01, TPOX, and vWA.
To genotype the 22 STR markers based on the human reference genome GRCh38, we ran the HipSTR algorithm for each individual. For this purpose, we used a BED file with the coordinates of each STR region of interest, which was available in the HipSTR repository [10] (https://hipstr-tool.github.io/HipSTR-tutorial/; accessed on 10 July 2021) as described elsewhere [14]. We applied the calling filter (15% stutter model) and a minimum of eight reads to obtain more reliable genotypes. According to a binomial distribution, this minimum number of reads ensures (p > 0.99) that a homozygous genotype is called because of lack of variability at a given locus and not because the second allele has not been sampled.
To perform genotype calling, we used the VCF output file produced by HipSTR and took three parameters into account: the reference allele of each marker, the period (i.e., the length of each STR repeat unit), and the base pair differences (GB) as compared to the reference allele. We adjusted the nomenclature for D19S433, Penta D, Penta E, and vWA by following the recommendations made by Valle-Silva et al. [14]: removal of two repeat units from all D19S433 and vWA alleles called by HipSTR, inclusion of one repeat unit into all Penta D alleles, and removal of two nucleotides from all Penta E alleles. By using IGV software 2.8.2 [16,17] and the HipSTR VizAln function [10], Valle-Silva et al. [14] demonstrated that such adjustments are necessary to prevent some base pairs from shifting in allele calling when compared to the nomenclature established by the ISFG [18].

2.2. Statistical Analysis

We calculated allele frequencies, the Hardy-Weinberg equilibrium, and forensic parameters {Match Probability (MP), Power of Discrimination (PD), Power of Exclusion (PE), and Polymorphism Information Content (PIC)} for each population sample or each population group using GenAlEx 6.5 [19] and STRAF 2.5.1 [20] software.
We employed Principal Coordinates Analysis (PCoA) using GenAlEx [19], Analysis of Molecular Variance (AMOVA) using Arlequin [21], and clustering analyses using STRUCTURE 2.3.4 [22] to explore how genetic diversity is distributed across populations of different ethnic backgrounds. We performed STRUCTURE analysis for k ranging from 3 to 6 by applying the correlated allele frequencies model, 100,000 burn-in steps followed by 100,000 Markov Chain Monte Carlo interactions, in 100 independent runs. We selected the results from the runs with the largest “Estimated Ln Probability of Data” {LnP (D)} and depicted them in bar plots created with Distruct 1.1 [23].
We also compared the allele frequencies estimated from the 1000 Genomes Project dataset with STR data retrieved from the same five major population groups (African, European, East Asian, South Asian, and admixed American) that compose the SPSmart STR browser (PopSTR) [24]. For this purpose, we employed Arlequin software to compare the allele frequencies of each STR marker for a given population group between the two datasets by using FST and an exact test of population differentiation based on genotype frequencies [21]. We made this comparison to verify the reliability of genotype data generated by HipSTR.

3. Results

The STR genotypes defined for each individual from the newest dataset released by the 1000 Genomes Project are available in Supplementary Table S1 as an open-access database. We excluded the D21S11 marker because we did not succeed in genotyping it (See discussion). Apart from this marker, the mean coverage for calling genotypes ranged from 37.14 (TPOX) to 52.53 (D12S391) (Table 1). The average successful calling rate was 98.59%; this rate ranged from 84.18% (Penta E) to 100% (CSF1PO, D2S441, D2S1338, D3S1358, D5S818, D8S1179, D22S1045, and TPOX) (Table 2).
Table 2 lists the allele frequencies and forensic parameters estimated for the whole dataset. The allele frequencies and forensic parameters estimated for each of the 26 populations (Supplementary Table S2) and the five population groups (Supplementary Table S3) are available as Supplementary Data. In general, the most polymorphic loci in all the populations were D1S1656, D2S1338, D12S391, D18S51, and FGA (Table 2). The analyzed loci were highly informative, with elevated PD values ranging between 86.59% (TPOX) and 97.76% (D1S1656). The combined MP was 5.72 × 10−27, and the combined PE was 0.99999997. Analysis of each locus in each population (Supplementary Table S2) showed that D22S1045 in PEL (71.61%) and D1S1656 in GBR (97.52%) presented the lowest and the highest PD value, respectively. The combined MP ranged from 1.98 × 10−25 in ACB to 2.20 × 10−21 in PEL.
We estimated the adherences of genotype frequencies to Hardy-Weinberg Equilibrium expectations for each STR marker at a population level (Table 3). Penta E presented heterozygote deficiency in 24 out of the 26 populations, leading to departures from the Hardy-Weinberg equilibrium. This finding indicated that HipSTR incorrectly called many heterozygous genotypes as homozygous. Disregarding Penta E, the number of deviations ranged from one (D13S317 and D16S539) to five (D19S433 and Penta D), and the number of deviations across populations ranged from zero (ASW and CEU) to seven (PUR), with an average of 2.42 departures in each population. When we considered the Bonferroni correction for multiple tests, only 39 departures remained significant, and most of them (61.53%) concerned Penta E.
Principal Coordinates Analysis (PCoA) revealed four different population clusters (Figure 1). The first coordinate separated the cluster of African (AFR) populations on the right side. On the left side, we observed three different population groups: the European (EUR) populations in the upper part, the East Asian (EAS) populations in the lower section, and the South Asian (SAS) populations between them. The CLM, MXL, PEL, and PUR admixed populations clustered with the European populations, while the ACB and ASW populations clustered with the African (AFR) populations, reflecting their ancestry compositions.
We obtained similar results when we conducted the STRUCTURE analysis. Figure 2 depicts the STRUCTURE results derived from runs obtained with k ranging from three to six. When k = 4, each cluster reflected one of the major ancestries of the 1000 Genomes Project. Moreover, each of the six admixed American populations presented varying levels of ancestries from the four biogeographical groups. To verify the distribution of variance in different levels, we performed AMOVA by assuming a hierarchical structure that gathered the populations in four population groups: AFR, EAS, EUR, and SAS. We did not take the six populations in the AMR population group into account because their admixed compositions would bias the AMOVA results by reducing the proportion of variance between groups. We observed most of the variance within populations (97.12%). Differences between the four population groups accounted for 2.54% of the variance, whereas only 0.34% of the variance occurred due to differences between populations belonging to the same group.
By using FST, we also compared the allele frequencies estimated from the dataset of the 1000 Genomes Project to the STR data retrieved for the same five major population groups (African, European, East Asian, South Asian, and admixed American) that composed the SPSmart STR browser (PopSTR) [24] (Table 4). While the AMR (four), EAS (three), EUR (eight), and SAS (four) population groups presented small numbers of markers with significantly different frequencies between the two datasets, AFR presented 17 significant differences. This pattern might reflect the set of populations that compose the compared groups. Penta E was the only marker that showed significantly different FST values in all comparisons. By leaving AFR and Penta E aside, we observed only 15 significant differences out of 80 comparisons: the mean number of statistically significant differences was 0.75 per marker; this number ranged from zero (eight STR markers) to three (D2S441). When we considered the Bonferroni correction for multiple tests, only three of these 15 FST values remained significant, while six out of 16 significant differences observed for AFR (leaving Penta E aside), and all five Penta E differences remained significantly different.

4. Discussion

The present study provides the most diverse database of forensic autosomal STR markers obtained from global populations. STR markers display high levels of polymorphism, which makes them attractive for forensic purposes and population genetics studies. This is the first time that the 1000 Genomes high-coverage (~30x) dataset has been used for STR genotyping purpose. Although a few previous initiatives [9,25,26] attempted to genotype forensically relevant STRs, they only dealt with previous low-coverage 1000 Genomes releases (~7.4x), which prevented the acquisition of results or resulted in highly unreliable genotypes due to large rates of allele dropout. Moreover, it should be emphasized that even the last paper that presented the high-coverage WGS data did not include STR variants in the results and stated that genotyping STRs from such data remains a considerable challenge [27].
In forensic genetics, STR markers consist in the most widespread and informative tool for human identification. In spite of the limitations addressed below, such as unreliability of Penta D and Penta E genotypes involving specific alleles, this NGS-based STR database presents reliable allele frequencies that could be used in criminal casework to estimate the rarity of a given STR-based profile from a query sample of unknown or uncertain ancestry in various worldwide populations. This could instantly, and without additional costs, trigger a DNA-based intelligence strategy to guide enquiries [28] providing hints and/or assigning biogeographical origin in many situations, such as a missing person investigation [28,29], leaving only the most complex cases for supplementary analysis with a most suitable set of Ancestry Informative Markers.
Short-read next generation sequencing is slowly being introduced in forensic labs worldwide. Although such technology is still restricted and expensive, it has become more sensitive, requiring as little as 25 pg of extracted DNA, and is suitable to solve more complex cases, such as discrimination of twins (using STRs, WGS or mtDNA sequencing approaches) and deconvolution of highly unbalanced mixtures reviewed by [30]. Some criminal [31,32,33], kinship [34] and missing persons [35] casework already benefiting from this have been reported. However, genotyping STR markers by using NGS data, especially WGS assays, may be challenging—accurate genotyping requires high coverage, longer alleles are difficult to detect due to reads of limited sizes, and mutations in flanking regions may lead to null alleles [36]. These and other issues have been addressed by Gaag et al. [37] and Valle-Silva et al. [14].
Notwithstanding the challenges addressed here, several studies have demonstrated that STRs can be genotyped by using dedicated bioinformatics tools. Software such as LobSTR [9], toaSTR [13], STRait Razor [12], and HipSTR [10], among others, have shown consistent and accurate results [14,15]. Moreover, Bornman et al. [8] demonstrated that, by using an NGS approach, CODIS loci could be accurately called even from mixtures.
Particularly for the deconvolution of mixtures, the identification of isometric alleles (i.e., alleles with the same length but containing different repeat sequences) is a necessary task, since it further increases the discriminating power of the currently used STR markers; nevertheless, it is not achieved with traditional PCR and capillary electrophoresis techniques [2,3]. This sequence-based analysis is already feasible with small-scale targeted sequencing assays, particularly those using kits and software solutions tailored for forensic purposes, such as the ForenSeq DNA Signature Prep Kit coupled with the ForenSeq™ Universal Analysis Software (Verogen Inc., San Diego, CA, USA) [38] or the Precision ID GlobalFiler™ NGS STR Panel v2 coupled with the Converge Software NGS Analysis Module (Thermo Fisher Scientific) [39], but it is still a challenge for large-scale WGS assays. In order to achieve this goal concerning big data in the near future, new bioinformatics tools must be developed, or the current ones further improved.
Willems et al. [26] analyzed human STR variation by using lobSTR. These authors employed the data of Phase 1 of the 1000 Genomes Project. The data were generated by using low-sequencing coverage, which is excessively error-prone. In fact, the authors reported difficulties in detecting both alleles in each sample, which resulted in an overall deficit of heterozygotes. As previously addressed, several reasons led us to choose HipSTR to call STR genotypes from this high-coverage dataset of the 1000 Genomes Project. Because HipSTR allows the flanking regions to be customized, almost any STR marker can be evaluated in hundreds of samples at once. At first glance, HipSTR may appear more complex, but it is the most appropriate tool to deal with whole genomes. In addition, a recent evaluation of the performance of this tool revealed high efficiency and accuracy levels [14].
Although HipSTR provides flexibility, the major limitation of this study is the inability to genotype D21S11, which is one of the 20 CODIS loci. Additional limitations are the failure in detecting two very small Penta D alleles and the biased allele frequencies of very large Penta D and Penta E alleles probably because of sequence-specific features, such as the GC content [40,41,42] producing low depth of coverage bias and/or the limited length of the Illumina NGS reads (150 bp paired-end reads). This issue could be immediately circumvented with long-read sequencing technologies, such as those implemented in Pacific Biosciences (PacBio) and Oxford Nanopore platforms. However, one should not expect that long-read sequencing would be suitable for a wide range of forensic samples, which are often degraded and/or available in low amounts [40,41,42,43]. It is noteworthy that, by employing 300 nucleotide-long paired-end reads in a targeted sequencing assay, we successfully genotyped D21S11 with HipSTR, which suggests a sequencing methodology issue rather than a bioinformatics issue [14].
In this study, Penta D and Penta E showed 10.74% and 15.81% of missing data, respectively. By using Illumina sequencing technology, van der Gaag et al. [37] showed that longer alleles of Penta D, Penta E, and FGA presented sequencing errors at the end of the reads, which resulted in null alleles and genotyping errors. As observed for D21S11, this issue was probably related to the impossibility of detecting longer alleles due to read-length constraints. Furthermore, we did not detect two very small Penta D alleles (2.2 and 3.2), which are common in African populations, which was unexpected. Supplementary Table S4 compares the allele frequencies estimated in the present study with the allele frequencies obtained from the SPSmart STR browser (PopSTR) [24] for the major population groups. Such straightforward comparison showed that we were not able to detect alleles larger than 18 in Penta E. This failure led directly to Hardy-Weinberg equilibrium deviations (Table S3) due to deficit of heterozygotes in 24 out of the 26 studied populations. Thus, allele frequencies estimated for Penta E were strongly biased toward increased frequencies of shorter alleles and have limited applicability (Supplementary Table S2). The probabilities obtained with the FST analysis (Table 4) supported this conclusion: Penta E presented significant FST values in all five comparisons. Although Penta D and FGA also posed this problem, their undetected alleles usually have low frequencies—Except for Penta D alleles 2.2 and 3.2 in African populations (Supplementary Table S4). Therefore, this technical issue did not influence the Hardy-Weinberg equilibrium and FST analysis as much as Penta E. Although this comparison is valid and helpful, we must emphasize that the compared samples corresponded to distinct population groups. The African population group in popSTR comprised mainly East African Somalian individuals (404 out of 507 samples), while the African populations in the 1000 Genomes Project samples corresponded to West Africa. Similarly, over 50% of the European population group in popSTR was composed mainly of U.S. Europeans (1443 out of 2135) [5,24]. Taken together, these results attest that the bioinformatics analysis performed in the present study is robust, and that the distribution of allele frequencies is reliable for all loci except Penta E.
The most polymorphic loci in the whole dataset of the 1000 Genomes Project were D1S1656, D2S1338, D12S391, D18S51, and FGA. All these markers presented high degrees of polymorphism throughout the world. AMOVA revealed that most of the variance (97.12%) in allele frequencies occurred within populations, corroborating previous studies [44,45]. A study that evaluated human population structure using genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 worldwide populations revealed that the variance within populations accounts for 93 to 95% of genetic variation, while differences among major groups constitute only 3 to 5% [45,46]. Although the number of populations and genetic markers are quite different, the larger amount of variance within populations and lower variance among groups observed in the present study may be either due to chance or to the fact that forensic STRs do show relatively lower FST than random STRs due to the increased heterozygosity of the former [46]. However, as expected, AMOVA, together with principal component analysis (Figure 1) and the clustering analysis performed with STRUCTURE (Figure 2), confirmed that the four ancestral populations groups (AFR, EUR, EAS, and SAS) defined by the 1000 Genomes Consortium did differ significantly from each other. Given that the admixed American populations present different ancestry compositions (Figure 2), most of them clustered with Europeans, while ACB and ASW clustered with Africans (Figure 1).
The results obtained with the STRUCTURE software corroborated the relationship between the different population groups and provided additional support for the reliability of the calculated genotypes. When k = 3, SAS resembled an admixture between EAS and EUR. A specific cluster for SAS emerged when k = 4. When k = 5, a minor Eurasian (shared between EUR and EAS) component arose. When k = 6, the SAS-shared ancestry with EUR and EAS became more evident. Regarding the admixed American populations, irrespective of the number of clusters considered, ACB and ASW revealed their preeminent African origin, CLM and PUR revealed more extensive European ancestry, and MXL and PEL revealed almost equal amounts of European and Amerindian (i.e., EAS) ancestries. These results fully corroborated the distribution of the populations into the PCoA (Figure 1). Additional clusters did not provide increased resolution with straightforward meaning.
The outcome of this population genetics evaluation further corroborates the robustness and reliability of this STR dataset. Despite all the applications already addressed in the beginning of this section, the most important contribution of this open access genotype dataset probably lies in the fact that it may be used to estimate and establish additional population genetics parameters that may be taken as direct references in many studies that are using the 1000 Genomes Project dataset to retrieve new sets of SNPs, indels and microhaplotypes in various efforts to maximize intelligence from DNA evidence [27,47,48,49,50].

5. Conclusions

We were able to offer a reliable open-access STR database based on the high-coverage (30x) WGS data of Phase 3 of the 1000 Genomes Project generated by the NYGC. However, the limited length of sequencing reads introduces noticeable bias in allele frequencies estimated for Penta D and Penta E. The reliability of this dataset is supported by (a) previous studies attesting that HipSTR is efficient, (b) the Hardy-Weinberg equilibrium analysis, (c) the set of analyses employed to evaluate the interpopulation genetic diversity, and (d) the comparison between the allele frequencies obtained here and the frequencies obtained by other initiatives that used capillary electrophoresis. Although we expect that this open-access database will be of great interest for future forensic studies on population genetics, the current 1000 Genomes Project dataset does not describe human genetic diversity worldwide. In fact, many biogeographical regions, mainly in Oceania and the Americas, have not been sampled, indicating that additional large-scale initiatives may provide further insight into STR diversity in populations worldwide.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/genes13122205/s1. Table S1: ST1-STR genotypes defined using HipSTR from the high-coverage New York Genome Center (NYGC) dataset released by the 1000 Genomes Project. Table S2: Summary of information necessary to understand the ST2 set of tables. Table S3: Summary of information necessary to understand the ST3 set of tables. Table S4: Summary of information necessary to understand the ST4 set of tables.

Author Contributions

T.S.F. and G.V.-S. contributed equally to this work; conceptualization, investigation, methodology, analysis, and writing of the paper. J.A. supported and helped with bioinformatic tools (software), and C.T.M.-J. with writing—reviewing, editing, and supervising. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)-Finance Code 001. C.T.M.-J. (#312802/2018-8) is supported by a Research fellowship from CNPq/Brazil.

Institutional Review Board Statement

Ethical review and approval were waived for this study because all data were derived from the 1000 genomes public database.

Informed Consent Statement

Not applicable.

Data Availability Statement

1000 Genomes Project Phase 3 samples in a high-coverage (30x) assay using the NovaSeq 6000 Sequencing System (Illumina, Inc.) https://www.internationalgenome.org/data-portal/data-collection/30x-grch38 accessed on 10 July 2021.

Acknowledgments

We thank Thomas Willems for sharing his knowledge and expertise in bioinformatics and population genetics and his support with the HipSTR tool, and Cynthia Maria de Campos Prado Manso for her assistance with the English language.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Børsting, C.; Morling, N. Next generation sequencing and its applications in forensic genetics. Forensic Sci. Int. Genet. 2015, 18, 78–89. [Google Scholar] [CrossRef] [PubMed]
  2. Alvarez-Cubero, M.J.; Saiz, M.; Martínez-García, B.; Sayalero, S.M.; Entrala, C.; Lorente, J.A.; Martinez-Gonzalez, L.J. Next generation sequencing: An application in forensic sciences? Ann. Hum. Biol. 2017, 44, 581–592. [Google Scholar] [CrossRef] [PubMed]
  3. Ballard, D.; Winkler-Galicki, J.; Wesoły, J. Massive parallel sequencing in forensics: Advantages, issues, technicalities, and prospects. Int. J. Leg. Med. 2020, 134, 1291–1303. [Google Scholar] [CrossRef]
  4. Auton, A.; Brooks, L.D.; Durbin, R.M.; Garrison, E.P.; Kang, H.M.; Korbel, J.O.; Marchini, J.L.; McCarthy, S.; McVean, G.A.; Abecasis, G.R.; et al. A global reference for human genetic variation. Nature 2015, 526, 68–74. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Clarke, L.; Fairley, S.; Zheng-Bradley, X.; Streeter, I.; Perry, E.; Lowy, E.; Tassé, A.M.; Flicek, P. The international Genome sample resource (IGSR): A worldwide collection of genome variation incorporating the 1000 Genomes Project data. Nucleic Acids Res. 2017, 45, D854–D859. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Sudmant, P.H.; Rausch, T.; Gardner, E.J.; Handsaker, R.E.; Abyzov, A.; Huddleston, J.; Zhang, Y.; Ye, K.; Jun, G.; Fritz, M.H.; et al. An integrated map of structural variation in 2,504 human genomes. Nature 2015, 526, 75–81. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Fungtammasan, A.; Ananda, G.; Hile, S.E.; Su, M.S.; Sun, C.; Harris, R.; Medvedev, P.; Eckert, K.; Makova, K.D. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications. Genome Res. 2015, 25, 736–749. [Google Scholar] [CrossRef] [Green Version]
  8. Bornman, D.M.; Hester, M.E.; Schuetter, J.M.; Kasoji, M.D.; Minard-Smith, A.; Barden, C.A.; Nelson, S.C.; Godbold, G.D.; Baker, C.H.; Yang, B.; et al. Short-read, high-throughput sequencing technology for STR genotyping. Biotech. Rapid Dispatches 2012, 2012, 1–6. [Google Scholar] [CrossRef]
  9. Gymrek, M.; Golan, D.; Rosset, S.; Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 2012, 22, 1154–1162. [Google Scholar] [CrossRef] [Green Version]
  10. Willems, T.; Zielinski, D.; Yuan, J.; Gordon, A.; Gymrek, M.; Erlich, Y. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 2017, 14, 590–592. [Google Scholar] [CrossRef]
  11. Fairley, S.; Lowy-Gallego, E.; Perry, E.; Flicek, P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2020, 48, D941–D947. [Google Scholar] [CrossRef] [PubMed]
  12. Warshauer, D.H.; Lin, D.; Hari, K.; Jain, R.; Davis, C.; Larue, B.; King, J.L.; Budowle, B. STRait Razor: A length-based forensic STR allele-calling tool for use with second generation sequencing data. Forensic Sci. Int. Genet. 2013, 7, 409–417. [Google Scholar] [CrossRef] [PubMed]
  13. Ganschow, S.; Silvery, J.; Kalinowski, J.; Tiemann, C. toaSTR: A web application for forensic STR genotyping by massively parallel sequencing. Forensic Sci. Int. Genet. 2018, 37, 21–28. [Google Scholar] [CrossRef] [PubMed]
  14. Valle-Silva, G.; Frontanilla, T.S.; Ayala, J.; Donadi, E.A.; Simões, A.L.; Castelli, E.C.; Mendes-Junior, C.T. Analysis and comparison of the STR genotypes called with HipSTR, STRait Razor and toaSTR by using next generation sequencing data in a Brazilian population sample. Forensic Sci. Int. Genet. 2022, 58, 102676. [Google Scholar] [CrossRef]
  15. Halman, A.; Oshlack, A. Accuracy of short tandem repeats genotyping tools in whole exome sequencing data. F1000Res 2020, 9, 200. [Google Scholar] [CrossRef] [Green Version]
  16. Thorvaldsdóttir, H.; Robinson, J.T.; Mesirov, J.P. Integrative Genomics Viewer (IGV): High-performance genomics data visualization and exploration. Brief. Bioinform. 2013, 14, 178–192. [Google Scholar] [CrossRef] [Green Version]
  17. Robinson, J.T.; Thorvaldsdóttir, H.; Wenger, A.M.; Zehir, A.; Mesirov, J.P. Variant Review with the Integrative Genomics Viewer. Cancer Res. 2017, 77, e31–e34. [Google Scholar] [CrossRef] [Green Version]
  18. Gettings, K.B.; Ballard, D.; Bodner, M.; Borsuk, L.A.; King, J.L.; Parson, W.; Phillips, C. Report from the STRAND Working Group on the 2019 STR sequence nomenclature meeting. Forensic Sci. Int. Genet. 2019, 43, 102165. [Google Scholar] [CrossRef] [Green Version]
  19. Peakall, R.; Smouse, P.E. GenAlEx 6.5: Genetic analysis in Excel. Population genetic software for teaching and research-an update. Bioinformatics 2012, 28, 2537–2539. [Google Scholar] [CrossRef] [Green Version]
  20. Gouy, A.; Zieger, M. STRAF-A convenient online tool for STR data evaluation in forensic genetics. Forensic Sci. Int. Genet. 2017, 30, 148–151. [Google Scholar] [CrossRef]
  21. Excoffier, L.; Lischer, H.E. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 2010, 10, 564–567. [Google Scholar] [CrossRef] [PubMed]
  22. Hubisz, M.J.; Falush, D.; Stephens, M.; Pritchard, J.K. Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 2009, 9, 1322–1332. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Rosenberg, N.A. Distruct: A program for the graphical display of population structure. Mol. Ecol. Notes 2004, 4, 137–138. [Google Scholar] [CrossRef]
  24. Jorge, A.; Christopher, P.; Toño, S.; Fernandez, F.L.; Ángel, C.; Maviky, L. pop.STR—An online population frequency browser for established and new forensic STRs. Forensic Sci. Int. Genet. Suppl. Ser. 2009, 2, 361–362. [Google Scholar]
  25. Tang, H.; Kirkness, E.F.; Lippert, C.; Biggs, W.H.; Fabani, M.; Guzman, E.; Ramakrishnan, S.; Lavrenko, V.; Kakaradov, B.; Hou, C.; et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am. J. Hum. Genet. 2017, 101, 700–715. [Google Scholar] [CrossRef]
  26. Willems, T.; Gymrek, M.; Highnam, G.; Mittelman, D.; Erlich, Y.; Consortium, G.P. The landscape of human STR variation. Genome Res. 2014, 24, 1894–1904. [Google Scholar] [CrossRef] [Green Version]
  27. Byrska-Bishop, M.; Evani, U.S.; Zhao, X.; Basile, A.O.; Abel, H.J.; Regier, A.A.; Corvelo, A.; Clarke, W.E.; Musunuri, R.; Nagulapalli, K.; et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 2022, 185, 3426–3440.e3419. [Google Scholar] [CrossRef] [PubMed]
  28. West, F.L.; Algee-Hewitt, B.F.B. Cadaveric blood cards: Assessing DNA quality and quantity and the utility of STRs for the individual estimation of trihybrid ancestry and admixture proportions. Forensic Sci. Int. Synerg. 2020, 2, 114–122. [Google Scholar] [CrossRef]
  29. Pereira, L.; Alshamali, F.; Andreassen, R.; Ballard, R.; Chantratita, W.; Cho, N.S.; Coudray, C.; Dugoujon, J.M.; Espinoza, M.; González-Andrade, F.; et al. PopAffiliator: Online calculator for individual affiliation to a major population group based on 17 autosomal short tandem repeat genotype profile. Int. J. Leg. Med. 2011, 125, 629–636. [Google Scholar] [CrossRef] [Green Version]
  30. Carratto, T.M.T.; Moraes, V.M.S.; Recalde, T.S.F.; Oliveira, M.L.G.; Teixeira Mendes-Junior, C. Applications of massively parallel sequencing in forensic genetics. Genet. Mol. Biol. 2022, 45, e20220077. [Google Scholar] [CrossRef]
  31. Yuan, L.; Chen, X.; Liu, Z.; Liu, Q.; Song, A.; Bao, G.; Wei, G.; Zhang, S.; Lu, J.; Wu, Y. Identification of the perpetrator among identical twins using next-generation sequencing technology: A case report. Forensic Sci. Int. Genet. 2020, 44, 102167. [Google Scholar] [CrossRef] [PubMed]
  32. Diepenbroek, M.; Bayer, B.; Schwender, K.; Schiller, R.; Lim, J.; Lagacé, R.; Anslinger, K. Evaluation of the Ion AmpliSeq™ PhenoTrivium Panel: MPS-Based Assay for Ancestry and Phenotype Predictions Challenged by Casework Samples. Genes 2020, 11, 1398. [Google Scholar] [CrossRef] [PubMed]
  33. Knijf, P.D. How Next Generation Sequencing Resolved a Difficult Case, Leading to the First Criminal Conviction of Its Kind; Verogen: San Diego, CA, USA, 2020; pp. 1–4. [Google Scholar]
  34. Pilli, E.; Tarallo, R.; Riccia, P.; Berti, A.; Novelletto, A. Kinship assignment with the ForenSeq™ DNA Signature Prep Kit: Sources of error in simulated and real cases. Sci. Justice 2022, 62, 1–9. [Google Scholar] [CrossRef] [PubMed]
  35. Cuenca, D.; Battaglia, J.; Halsing, M.; Sheehan, S. Mitochondrial Sequencing of Missing Persons DNA Casework by Implementing Thermo Fisher’s Precision ID mtDNA Whole Genome Assay. Genes 2020, 11, 1303. [Google Scholar] [CrossRef]
  36. Aalbers, S.E.; Hipp, M.J.; Kennedy, S.R.; Weir, B.S. Analyzing population structure for forensic STR markers in next generation sequencing data. Forensic Sci. Int. Genet. 2020, 49, 102364. [Google Scholar] [CrossRef] [PubMed]
  37. van der Gaag, K.J.; de Leeuw, R.H.; Hoogenboom, J.; Patel, J.; Storts, D.R.; Laros, J.F.J.; de Knijff, P. Massively parallel sequencing of short tandem repeats-Population data and mixture analysis results for the PowerSeq™ system. Forensic Sci. Int. Genet. 2016, 24, 86–96. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Verogen. Universal Analysis Software. Available online: https://verogen.com/products/universal-analysis-software/ (accessed on 20 October 2022).
  39. Scientific, T.F. Precision ID GlobalFiler™ NGS STR Panel v2. Available online: http://www.thermofisher.com/hid-ngs (accessed on 20 October 2022).
  40. Wang, W.; Wei, Z.; Lam, T.W.; Wang, J. Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Sci. Rep. 2011, 1, 55. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  41. Sims, D.; Sudbery, I.; Ilott, N.E.; Heger, A.; Ponting, C.P. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet. 2014, 15, 121–132. [Google Scholar] [CrossRef]
  42. Castelli, E.C.; Gerasimou, P.; Paz, M.A.; Ramalho, J.; Porto, I.O.P.; Lima, T.H.A.; Souza, A.S.; Veiga-Castelli, L.C.; Collares, C.V.A.; Donadi, E.A.; et al. HLA-G variability and haplotypes detected by massively parallel sequencing procedures in the geographicaly distinct population samples of Brazil and Cyprus. Mol. Immunol. 2017, 83, 115–126. [Google Scholar] [CrossRef] [Green Version]
  43. Belsare, S.; Levy-Sakin, M.; Mostovoy, Y.; Durinck, S.; Chaudhuri, S.; Xiao, M.; Peterson, A.S.; Kwok, P.Y.; Seshagiri, S.; Wall, J.D. Evaluating the quality of the 1000 genomes project data. BMC Genom. 2019, 20, 620. [Google Scholar] [CrossRef] [Green Version]
  44. Rosenberg, N.A. A population-genetic perspective on the similarities and differences among worldwide human populations. Hum. Biol. 2011, 83, 659–684. [Google Scholar] [CrossRef]
  45. Rosenberg, N.A.; Pritchard, J.K.; Weber, J.L.; Cann, H.M.; Kidd, K.K.; Zhivotovsky, L.A.; Feldman, M.W. Genetic structure of human populations. Science 2002, 298, 2381–2385. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  46. Jobling, M.A. Forensic genetics through the lens of Lewontin: Population structure, ancestry and race. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2022, 377, 20200422. [Google Scholar] [CrossRef] [PubMed]
  47. de la Puente, M.; Ruiz-Ramírez, J.; Ambroa-Conde, A.; Xavier, C.; Pardo-Seco, J.; Álvarez-Dios, J.; Freire-Aradas, A.; Mosquera-Miguel, A.; Gross, T.E.; Cheung, E.Y.Y.; et al. Development and Evaluation of the Ancestry Informative Marker Panel of the VISAGE Basic Tool. Genes 2021, 12, 1284. [Google Scholar] [CrossRef] [PubMed]
  48. Phillips, C.; Amigo, J.; Tillmar, A.O.; Peck, M.A.; de la Puente, M.; Ruiz-Ramírez, J.; Bittner, F.; Idrizbegović, Š.; Wang, Y.; Parsons, T.J.; et al. A compilation of tri-allelic SNPs from 1000 Genomes and use of the most polymorphic loci for a large-scale human identification panel. Forensic Sci. Int. Genet. 2020, 46, 102232. [Google Scholar] [CrossRef] [Green Version]
  49. Lan, Q.; Fang, Y.; Mei, S.; Xie, T.; Liu, Y.; Jin, X.; Yang, G.; Zhu, B. Next generation sequencing of a set of ancestry-informative SNPs: Ancestry assignment of three continental populations and estimating ancestry composition for Mongolians. Mol. Genet. Genom. 2020, 295, 1027–1038. [Google Scholar] [CrossRef]
  50. Huang, S.; Sheng, M.; Li, Z.; Li, K.; Chen, J.; Wu, J.; Wang, K.; Shi, C.; Ding, H.; Zhou, H.; et al. Inferring bio-geographical ancestry with 35 microhaplotypes. Forensic Sci. Int. 2022, 341, 111509. [Google Scholar] [CrossRef]
Figure 1. Principal Coordinates Analysis (PCoA) based on autosomal STR data regarding the 26 populations analyzed in the 1000 Genomes Project. Each point represents a population sample. More details on these populations are available in Supplementary Table S2. Coordinates 1 and 2 account for 39.15% and 19.25% of the variance, respectively.
Figure 1. Principal Coordinates Analysis (PCoA) based on autosomal STR data regarding the 26 populations analyzed in the 1000 Genomes Project. Each point represents a population sample. More details on these populations are available in Supplementary Table S2. Coordinates 1 and 2 account for 39.15% and 19.25% of the variance, respectively.
Genes 13 02205 g001
Figure 2. STRUCTURE analysis based on autosomal STR data obtained from the 26 populations included in the 1000 Genomes Project. Five sets of 100 independent runs, with the number of clusters ranging from 3 to 6, were conducted. Each bar plot depicts the results obtained from the run with the largest LnP (D) for the given k.
Figure 2. STRUCTURE analysis based on autosomal STR data obtained from the 26 populations included in the 1000 Genomes Project. Five sets of 100 independent runs, with the number of clusters ranging from 3 to 6, were conducted. Each bar plot depicts the results obtained from the run with the largest LnP (D) for the given k.
Genes 13 02205 g002
Table 1. Average coverages obtained for each STR marker using the HipSTR tool.
Table 1. Average coverages obtained for each STR marker using the HipSTR tool.
MarkerLowest ValueMedianHighest ValueMeanStandard Deviation
CSF1PO21449144.548.29
D1S165624459245.488.58
D2S441264913149.179.04
D2S1338285010551.289.64
D3S1358285111951.739.33
D5S81820429842.908.14
D7S82020388639.007.78
D8S117926479648.098.95
D10S1248184010040.757.99
D12S391265211352.539.47
D13S31711377937.937.50
D16S53921449244.708.49
D18S5124479147.389.07
D19S43319458945.288.62
D22S1045224911149.479.11
FGA235011851.339.49
Penta D19439543.718.74
Penta E184110741.498.04
TH0116408340.738.01
TPOX15378637.147.66
vWA214710548.449.36
Table 2. Allelic frequencies and the forensic parameters estimated for each STR marker in the whole dataset of the 1000 Genomes Project.
Table 2. Allelic frequencies and the forensic parameters estimated for each STR marker in the whole dataset of the 1000 Genomes Project.
AlleleCSF1POD1S1656D2S441D2S1338D3S1358D5S818D7S820D8S1179D10S1248D12S391D13S317D16S539D18S51D19S433D22S1045FGAPenta DPenta ETH01TPOXvWA
5 0.0002 0.01250.08610.0020
6 0.0004 0.0002 0.0004 0.00340.00170.19050.0210
70.01980.0002 0.01180.0178 0.0002 0.00180.0004 0.0002 0.01500.10840.28280.0074
7.30.0002
80.02180.00800.00060.0002 0.02220.18420.00640.0006 0.13940.0314 0.00020.0006 0.05390.08540.12870.4217
8.3 0.0002
90.03340.00040.0018 0.00040.04210.09420.00660.0004 0.08910.18050.00040.00120.0002 0.21950.03580.23140.1534
9.2 0.0052 0.0016 0.0002
9.3 0.1491
100.23980.00800.2264 0.09760.25340.09530.0010 0.07630.11230.00460.00280.0154 0.16600.08730.01500.06210.0004
10.2 0.0004 0.0004 0.0006
10.30.0002 0.0002
110.26580.07880.3401 0.00040.31870.25060.07530.0166 0.26500.28880.01100.02100.1649 0.18390.17740.00020.29710.0014
11.2 0.0004 0.0002 0.0008
11.3 0.0529
11.4 0.0002
120.33750.07320.1062 0.00220.30970.16500.11800.0684 0.29120.23360.08050.07450.0170 0.15060.1886 0.03590.0008
12.2 0.0002 0.00060.0122
12.3 0.0022 0.0004
130.06930.11420.02820.00020.00340.18410.02750.23400.2616 0.09820.13170.11550.27060.0034 0.13800.1065 0.00060.0072
13.2 0.00320.0341
13.3 0.0008 0.0002 0.0002
140.01080.14720.21210.00060.07770.01160.00420.24180.28160.00100.03700.02000.16740.27260.0509 0.03960.0669 0.00060.1255
14.2 0.00100.0623 0.0002 0.0004
14.3 0.00280.0004 0.0002
150.00160.17720.02020.00140.30750.0020 0.15660.22200.03360.00160.00140.16700.10420.33130.00040.01340.0380 0.1217
15.1 0.0002
15.2 0.0006 0.00040.0799 0.0002
15.3 0.0272 0.0002
16 0.14100.00260.03180.30310.0002 0.05590.11780.0342 0.14140.02950.25020.00060.00310.0176 0.2251
16.2 0.0002 0.0004 0.0240 0.0004
16.3 0.0560 0.0002
17 0.0448 0.13280.2063 0.00800.02680.1109 0.11870.00460.14720.00160.00070.0005 0.2433
17.2 0.0004 0.0014 0.0034
17.3 0.0796 0.0078
18 0.0060 0.08970.0903 0.00220.00220.2264 0.0787 0.01700.0110 0.1753
18.2 0.0008 0.00020.0008 0.0030
18.3 0.0298 0.0096 0.0002 0.0002
19 0.0006 0.16790.0072 0.00020.1743 0.0519 0.00140.0673 0.0770
19.2 0.0030 0.0016
19.3 0.0040 0.0044
20 0.11060.0008 0.1415 0.0308 0.00040.0906 0.0206
20.2 0.0004 0.0002 0.0012
20.3 0.0006 0.0002
21 0.0637 0.0911 0.0130 0.00020.1247 0.0006
21.2 0.0034
22 0.0813 0.0741 0.0076 0.1785 0.0004
22.2 0.0046
23 0.1306 0.0552 0.0028 0.1679
23.2 0.0034
23.3 0.0002
24 0.1012 0.0180 0.0014 0.1619
24.2 0.0048
25 0.0685 0.0104 0.0004 0.1026
25.2 0.0028
25.3 0.0002
26 0.0154 0.0014 0.0436
26.2 0.0012
26.3 0.0002
27 0.0030 0.0137
27.2 0.0002
28 0.0010 0.0054
29 0.0002 0.0026
30 0.0002
N250425002504250425042504249425042500250224852502249724962504249022352108250225042499
Na11211518131013111622119272214311513101016
Ho0.74920.84400.73800.87220.74960.72200.79270.81030.75360.84370.76660.78380.86140.81210.73640.83050.78080.48770.74500.66210.7943
He0.75120.88930.77290.89020.75690.75670.80200.83050.78360.86650.80100.79830.88010.82270.77550.87280.84390.88030.79140.70480.8226
MP0.10590.02240.08330.02260.10020.09780.06900.04990.07850.03240.06760.06970.02670.05090.08410.02870.04230.04200.07370.13410.0548
PE0.50840.68310.48950.73910.50910.46320.58560.61830.51590.68250.53860.56930.71750.62170.48680.65690.56380.17690.50130.37220.5885
PD0.89410.97760.91670.97740.89980.90220.93100.95010.92150.96760.93240.93030.97330.94910.91590.97130.95770.95800.92630.86590.9452
PIC0.71040.87900.73930.87970.71690.71800.77260.80880.75010.85260.77380.76880.86800.80210.74210.85930.82440.86840.75890.65750.7985
N: number of samples; Na: number of alleles; Ho: observed heterozygosity; He: expected heterozygosity; MP: match probability; PE: power of exclusion; PD: power of discrimination; PIC: polymorphism information content.
Table 3. Probabilities of adherence to Hardy-Weinberg equilibrium proportions for each STR in all the 26 subpopulations analyzed in the 1000 Genomes Project. Significant p-values (α = 0.05) are in boldface. The probabilities that remained significant after the Bonferroni correction for multiple tests (αBONFERRONI = 0.05/546 = 0.000092) are also underlined.
Table 3. Probabilities of adherence to Hardy-Weinberg equilibrium proportions for each STR in all the 26 subpopulations analyzed in the 1000 Genomes Project. Significant p-values (α = 0.05) are in boldface. The probabilities that remained significant after the Bonferroni correction for multiple tests (αBONFERRONI = 0.05/546 = 0.000092) are also underlined.
POPCSF1POD1S1656D2S441D2S1338D3S1358D5S818D7S820D8S1179D10S1248D12S391D13S317D16S539D18S51D19S433D22S1045FGAPenta DPenta ETH01TPOXvWA
ACB0.83170.10030.95620.67680.72890.54390.02380.16260.76250.99370.15090.90580.17970.97950.82940.88140.19850.28950.41840.06780.1618
ASW0.84940.81190.98050.76070.82980.89450.62510.99770.52250.48940.24470.99270.63510.34020.40350.82250.35190.07240.95110.92480.9409
BEB0.61650.63210.16210.02160.97400.66140.45150.94760.87950.91240.58290.66620.33970.84890.65790.84940.46900.00000.20400.48500.9571
CDX0.99170.38230.46680.93080.40530.00000.57570.97970.54250.42540.46780.26050.67010.52180.24540.16030.45600.00000.02680.00250.4988
CEU0.93310.53010.99480.12330.18510.06740.66880.26000.83950.53140.83540.95590.10120.94710.83540.89970.52080.00000.75910.41910.1209
CHB0.31050.79450.00050.83050.94070.81900.41590.21940.75850.00030.96640.48470.15810.00470.96890.97470.00110.00000.94240.89760.0000
CHS0.23020.10770.28440.62370.57400.07270.98940.48830.38590.61800.76260.63860.39690.03910.88430.96180.42780.00000.17680.77230.8666
CLM0.90750.31080.44150.06840.35600.04700.00000.85580.54500.00040.14120.58920.08290.96820.99990.99760.69150.00000.42870.40760.7427
ESN0.11750.99880.97060.97500.03030.20280.67730.01310.87210.10690.84430.98230.95290.48690.02091.00000.01100.00000.92380.34360.0579
FIN0.95580.46120.76270.00000.59220.52690.85960.38180.89760.95310.96880.79190.21470.96650.98690.73780.99300.00000.95040.83690.9465
GBR0.93110.95060.27880.85050.99250.80370.83790.98280.87910.80610.21960.02590.24830.00000.03170.98790.22630.00000.05120.45300.4718
GIH0.69650.02390.83700.62880.99930.43250.68990.00110.88360.98080.38630.68180.99790.71260.49950.73710.67900.00000.07700.78270.3344
GWD0.19500.68320.99700.99780.51070.89420.27030.32130.97180.95270.49870.78100.00980.99730.21500.19320.00000.00000.18040.85300.9779
IBS0.00000.81110.93730.96900.12460.98740.88820.85010.24480.53440.90120.12020.99430.97060.93730.55060.03330.00000.43930.67660.3111
ITU0.61560.33670.00970.74590.73670.97230.86850.87210.11010.96170.78530.56360.97770.90160.98030.04040.89920.00000.90270.34810.4022
JPT0.85540.79450.50030.71910.65660.63120.75900.26120.81540.78910.38620.95890.99220.99520.00060.70520.91590.00000.77710.00020.8666
KWV0.99420.00250.82990.97950.98990.29800.22960.37370.70300.94830.58150.94890.16980.51070.90730.18620.00000.00000.52440.00060.4226
LWK0.36210.99130.98380.33630.52450.89760.67510.61440.84360.14780.69340.70810.29470.44160.93960.99980.60110.00000.95480.03250.9572
MSL0.80990.22440.44960.72580.93160.06280.03720.70880.09920.94420.98970.84701.00000.93980.27790.10570.20860.00000.01710.59080.0000
MXL0.00470.01090.82970.62020.12540.86310.54560.01250.05090.43950.89890.78870.00000.38600.92330.15620.52810.00000.37470.40290.5094
PEL0.84600.12470.99760.71220.16810.97810.76760.71920.96350.94020.12810.61860.98330.91760.87860.00000.57300.00000.83990.21700.0467
PJL0.86830.00731.00000.64860.89660.00010.89960.51880.87220.53880.99430.45650.99760.64900.91660.00530.14850.00000.07210.66040.2985
PUR0.78470.08190.00580.90970.13980.83420.76980.47040.00340.00450.03370.05560.71410.00060.98100.79650.75470.00000.00000.25710.0191
STU0.90280.59300.00000.82900.25460.26610.00710.47700.98820.20490.09520.19270.42620.23350.00010.63820.21290.00000.60820.53110.9627
TSI0.62990.43930.97770.08050.57190.44240.62400.41070.03560.65730.57770.95810.00740.00000.12400.56980.35560.00000.02030.39320.8569
YIR0.69080.23950.89250.30040.00000.86110.77010.92770.60500.30790.51970.40610.98670.84170.94510.96430.49230.00000.42850.12990.9942
Table 4. FST and probabilities of population non-differentiation comparing population groups of the 1000 Genomes Project to those of the SPSmart STR browser (PopSTR) for each STR marker. Significant p-values (α = 0.05) are in boldface. The probabilities that remained significant after the Bonferroni correction for multiple tests (αBONFERRONI = 0.05/105 = 0.00048) are also underlined.
Table 4. FST and probabilities of population non-differentiation comparing population groups of the 1000 Genomes Project to those of the SPSmart STR browser (PopSTR) for each STR marker. Significant p-values (α = 0.05) are in boldface. The probabilities that remained significant after the Bonferroni correction for multiple tests (αBONFERRONI = 0.05/105 = 0.00048) are also underlined.
MarkerAFRAMREASEURSAS
FSTp-valueFSTp-ValueFSTp-ValueFSTp-ValueFSTp-Value
CSF1PO0.00840.0057 ± 0.0007−0.001040.7571 ± 0.0047−0.00140.6656 ± 0.00530.00040.2445 ± 0.00440.00190.1951 ± 0.0036
D1S16560.00240.0355 ± 0.00190.000280.3149 ± 0.0046−0.00240.9774 ± 0.00150.00450.0000 ± 0.00000.00210.1152 ± 0.0032
D2S4410.00930.0005 ± 0.00020.003700.0216 ± 0.00150.00250.1251 ± 0.00270.00740.0000 ± 0.00000.01220.0058 ± 0.0007
D2S13380.01240.0000 ± 0.0000−0.000480.6527 ± 0.0049−0.00260.9940 ± 0.00070.00310.0008 ± 0.00030.00240.1044 ± 0.0030
D3S13580.00440.021 ± 0.0014−0.001220.8369 ± 0.00400.00100.2578 ± 0.00440.00130.0835 ± 0.00260.00330.1143 ± 0.0031
D5S8180.00300.0477 ± 0.00240.006090.0051 ± 0.0008−0.00140.6936 ± 0.0048−0.00070.7638 ± 0.0047−0.00140.6250 ± 0.0050
D7S8200.00470.0095 ± 0.0009−0.001050.7982 ± 0.0041−0.00060.5151 ± 0.0050−0.00030.5677 ± 0.0054−0.00230.8731 ± 0.0034
D8S11790.00330.0308 ± 0.0018−0.001580.9817 ± 0.0014−0.00160.7986 ± 0.00420.00240.0150 ± 0.0012−0.00030.4615 ± 0.0049
D10S12480.00170.1081 ± 0.0033−0.001320.8984 ± 0.0029−0.00310.9976 ± 0.00050.00030.2627 ± 0.0044−0.00010.4055 ± 0.0046
D12S3910.00070.1920 ± 0.00410.000770.1805 ± 0.0036−0.00190.8774 ± 0.00300.00080.0946 ± 0.0030−0.00090.6155 ± 0.0047
D13S3170.01600.0000 ± 0.00000.000170.3478 ± 0.0044−0.00180.7993 ± 0.00450.00030.2904 ± 0.00470.00510.0354 ± 0.0019
D16S5390.01050.0002 ± 0.0001−0.000510.5773 ± 0.0047−0.00100.5939 ± 0.00460.00470.0009 ± 0.0003−0.00100.5987 ± 0.0052
D18S510.00180.0483 ± 0.0019−0.001380.9831 ± 0.0013−0.00070.5795 ± 0.00550.00050.2028 ± 0.0042−0.00030.4671 ± 0.0047
D19S4330.00930.0003 ± 0.00020.001510.1085 ± 0.00270.00100.2476 ± 0.00410.00000.3759 ± 0.00490.00360.0660 ± 0.0028
D22S10450.00040.2943 ± 0.00530.002170.0766 ± 0.00250.00940.0055 ± 0.00070.00340.0087 ± 0.00090.00540.0516 ± 0.0021
FGA0.00110.1503 ± 0.0037−0.000890.8382 ± 0.00340.00360.0397 ± 0.00200.00070.1306 ± 0.00350.00180.1518 ± 0.0039
Penta D0.02750.0000 ± 0.00000.003180.0113 ± 0.0010−0.00150.7660 ± 0.00420.00030.2969 ± 0.00420.00000.4117 ± 0.0048
Penta E0.01390.0000 ± 0.00000.007330.0000 ± 0.00000.04120.0000 ± 0.00000.00830.0000 ± 0.00000.01790.0000 ± 0.0000
TH010.01030.0004 ± 0.00020.000900.2083 ± 0.0036−0.00250.9262 ± 0.00270.00570.0001 ± 0.00010.00260.1425 ± 0.0038
TPOX0.00390.0269 ± 0.0019−0.000570.5622 ± 0.005−0.00110.5341 ± 0.00460.00150.0939 ± 0.00330.00630.0455 ± 0.0020
vWA0.00650.0007 ± 0.0003−0.000950.7575 ± 0.00470.00050.3135 ± 0.0047−0.00010.4312 ± 0.0051−0.00100.5977 ± 0.0049
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Frontanilla, T.S.; Valle-Silva, G.; Ayala, J.; Mendes-Junior, C.T. Open-Access Worldwide Population STR Database Constructed Using High-Coverage Massively Parallel Sequencing Data Obtained from the 1000 Genomes Project. Genes 2022, 13, 2205. https://doi.org/10.3390/genes13122205

AMA Style

Frontanilla TS, Valle-Silva G, Ayala J, Mendes-Junior CT. Open-Access Worldwide Population STR Database Constructed Using High-Coverage Massively Parallel Sequencing Data Obtained from the 1000 Genomes Project. Genes. 2022; 13(12):2205. https://doi.org/10.3390/genes13122205

Chicago/Turabian Style

Frontanilla, Tamara Soledad, Guilherme Valle-Silva, Jesus Ayala, and Celso Teixeira Mendes-Junior. 2022. "Open-Access Worldwide Population STR Database Constructed Using High-Coverage Massively Parallel Sequencing Data Obtained from the 1000 Genomes Project" Genes 13, no. 12: 2205. https://doi.org/10.3390/genes13122205

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop