Research paperA multivariate statistical approach for the estimation of the ethnic origin of unknown genetic profiles in forensic genetics
Graphical abstract
Introduction
DNA profiling of biological evidence such as those recovered from crime scenes, mass-disaster areas or missing person investigations is one of the most challenging topics in forensic sciences [[1], [2], [3]]. Through the years, DNA typing has been more and more employed, exploiting large sets of genetic markers that can be simultaneously analyzed on a single biological sample or trace, even if containing only a few copies of DNA.
In the last decade, because of continuous technical developments in forensic genetics, DNA analysis moved towards the so-called Next Generation Sequencing (NGS) or Massive Parallel Sequencing (MPS). Currently, this technology enables genotyping at a large number of Short Tandem Repeats (STRs) loci in addition to an ever growing number of further markers such as, for example, autosomal and Y-chromosome Single Nucleotide Polymorphisms (SNPs) and mitochondrial DNA (mtDNA) variants [4]. Nowadays, STRs markers are widely utilized for personal identification in the interpretation process of single source samples and DNA mixtures collected e.g. during crime scene investigation activities [[5], [6], [7]]. On the other hand, the more evolutionarily stable SNPs, in the biparental and uniparental portions of the genome, are being used to infer the biogeographical ancestry and ethnic origins (generally named as BGA) of individuals and degraded samples [[8], [9], [10], [11]]. Up to date, while autosomal STRs markers are the elective tool for personal identification, they have been poorly employed as Ancestry Informative Markers (AIMs) as STR alleles equal in state occur in diverse populations, mostly because of recurrent mutation (homoplasy).
Bayesian statistics have been applied to estimate the ethnic affiliation of unknown genetic profiles [12,13] obtained with autosomal STRs in well-known software such as STRUCTURE [14], the Snipper App suite [15] and PopAffiliator 2 [16]. These approaches perform Bayesian evaluations by inferring the relationships between the allele frequencies of specific populations and the alleles observed in the individuals, which are recognized as part of such populations. This is done by computing the likelihood values of membership to each of the tested population groups, according to their relative allele frequencies. An advantage of these methodologies is that prior information about the samples can be considered during the advancement the analysis [17]. In the case of multi-locus genotypes, the power to obtain large amounts of data from a single biological sample requires appropriate statistical strategies to extract as precise information as possible regarding its ancestry. In this context, multivariate data analysis techniques may provide useful advantages to infer ethnic affiliation or ancestry of unknown subjects’ genetic profiles. These methods may simultaneously perform specific and sensitive discriminations among different groups. Software based on Likelihood Ratios (LR) traditionally involve the comparison of only two alternative hypotheses, while multivariate techniques may efficiently evaluate several population groups together. However, the likelihood-based methods for BGA estimation overcome this issue by computing the likelihood of membership to each of the populations under exam [17,18]. In the present study, we employ multivariate methodologies such as Sparse and Logistic Principal Component Analysis (SL-PCA) [19], Sparse Partial Least Squares-Discriminant Analysis (sPLS-DA) [[20], [21], [22]] and Support Vector Machines (SVM) [[23], [24], [25]] on autosomal STRs data sets. These multivariate techniques were selected as they turned capable of dealing with the nature of the genotypic data, which can be easily binarized. Our goal was to develop multivariate approaches for the interpretation of DNA profiles to better estimate the biogeographical ancestry information of personal genetic profiles, by building dynamic and flexible models that could be easily modified according to the number of tested populations and the number of markers in the profile and the reference panel. Our multivariate statistics approach may represent a powerful tool for research purposes and the investigative authorities, too.
Section snippets
Datasets
Four different population datasets were selected for this study. All the datasets consisted of individual genotypes rather than allele frequencies. In order of decreasing heterogeneity, the first dataset was extracted from the NIST U.S. population database [26], and consisted of genotypic data for U.S. African-American (N = 342), Asian (N = 97) and Caucasian (N = 361). For this dataset, the following 24 markers were selected: D1S1656, D2S441, D2S1338, D3S1358, D5S818, D6S1043, D7S820, D8S1179,
SL-PCA analysis
SL-PCA was first exploited to rapidly investigate the main features in the datasets. For the NIST dataset, (Fig. 1a), three main clusters corresponding to the African-American, Caucasian and Asian individuals were observed in the space of the first two PCs (accounting for 88.03 % of total variance). A good separation was also observed for the SL-PCA comparison involving the Northern African and the Sub-Saharan African individuals, where the first two PCs accounted for 65.19 % of the total
Conclusions
The present proof-of-concept study demonstrates the capability of multivariate statistics approaches to predict the population affiliation of autosomal genetic profiles that can be commonly recovered from any source, including crime scenes, mass-disaster and missing person investigations. sPLS-DA and SVM techniques drastically improved PCA, by providing optimal discrimination results (i.e. showing the lowest sensitivity value equal to 84 %) and being capable of assessing the group affiliation
Acknowledgements
This work was supported by: Sapienza University of Rome (grant n. RM11715C77B03CDC to FC); University of Pavia strategic theme “Towards a governance model for international migration: an interdisciplinary and diachronic perspective” (MIGRAT-IN-G) (OS); the Italian Ministry of Education, University and Research (MIUR): Dipartimenti di Eccellenza Program (2018–2022), Dept. of Biology and Biotechnology "L. Spallanzani", University of Pavia (OS).
References (47)
- et al.
Genotyping and interpretation of STR-DNA: low-template, mixtures and database matches-Twenty years of research and development
Forensic Sci. Int. Genet.
(2015) - et al.
DNA commission of the international society of forensic genetics: recommendations on the interpretation of mixtures
Forensic Sci. Int.
(2006) - et al.
An illustration of the effect of various sources of uncertainty on DNA likelihood ratio calculations
Forensic Sci. Int. Genet.
(2014) - et al.
Allele frequencies for 70 autosomal SNP loci with U.S. Caucasian, African-American, and Hispanic samples
Forensic Sci. Int.
(2005) - et al.
Development of a SNP set for human identification: a set with high powers of discrimination which yields high genetic information from naturally degraded DNA samples in the Thai population
Forensic Sci. Int. Genet.
(2014) Some mathematical problems in the DNA identification of victims in the 2004 tsunami and similar mass fatalities
Forensic Sci. Int.
(2006)- et al.
Issues and strategies in the DNA identification of World Trade Center victims
Theor. Popul. Biol.
(2003) - et al.
U.S. Population data for 29 autosomal STR loci
Forensic Sci. Int. Genet.
(2013) - et al.
New guidelines for the publication of genetic population data
Forensic Sci. Int. Genet.
(2013) - et al.
Update of the guidelines for the publication of genetic population data
Forensic Sci. Int. Genet.
(2014)
Revised guidelines for the publication of genetic population data
Forensic Sci. Int. Genet.
Allele frequencies of 15 autosomal STR loci in the Iraq population with comparisons to other populations from the middle-eastern region
Forensic Sci. Int.
Allele frequencies of the new European Standard Set (ESS) loci in the Italian population
Forensic Sci. Int. Genet.
STRAF—a convenient online tool for STR data evaluation in forensic genetics
Forensic Sci. Int. Genet.
PLS-regression: a basic tool of chemometrics
Chemometr. Intell. Lab. Syst.
Handbook of Forensic Genetics
Improving human forensics through advances in genetics, genomics and molecular biology
Nat. Rev. Genet.
Development and validation of the EUROFORGEN NAME (North African and Middle Eastern) ancestry panel
Forensic Sci. Int. Genet.
Mixture Interpretation: Defining the Relevant Features for Guidelines for the Assessment of Mixed DNA Profiles in Forensic Casework
J. Forensic Sci.
Advanced Topics in Forensic DNA Typing: Methodology
Inference of ancestry in forensic analysis I: autosomal ancestry-informative marker sets
Methods Mol. Biol.
An overview of STRUCTURE: applications, parameter settings, and supporting software
Front. Genet.
Inference of ancestry in forensic analysis II: analysis of genetic data
Methods Mol. Biol.
Cited by (15)
Inferring bio-geographical ancestry with 35 microhaplotypes
2022, Forensic Science InternationalCitation Excerpt :During the past decades, researchers have begun harnessing high-throughput genetic data to reveal the associations between phenotypic and genomic variations in worldwide human populations. Among the various human phenotypes, forensic scientists are mainly concerned with externally visible characteristics (EVCs) and bio-geographical ancestry (BGA) [1–7]. The phenotypic information obtained from DNA can provide investigative leads to trace unknown perpetrators and identify missing persons or victims of disasters when DNA profiling gets no exact or relatedness matches in the present datasets.
A multipurpose panel of microhaplotypes for use with STR markers in casework
2022, Forensic Science International: GeneticsCitation Excerpt :The popSTR dataset does not contain the individual-specific genotype profiles that would allow STRUCTURE analysis of the populations. However, other statistical approaches have shown that they can provide some ancestry information [21–23]. We have used PCA on the population frequencies of the 24 STR loci (Fig. 7).
Quantitative Analysis of Colombian Waste Picker’s Profile
2023, Sustainability (Switzerland)