Keywords
bioinformatics, kinship, python, genealogy, GWAS, simulation, benchmarking
This article is included in the Bioinformatics gateway.
This article is included in the Python collection.
bioinformatics, kinship, python, genealogy, GWAS, simulation, benchmarking
Single nucleotide polymorphisms (SNPs) are inherited single base-pair substitutions in genomic DNA that are easily characterized by microarray, targeted, or whole-genome sequencing, and have a wide range of uses in personalized medicine, disease association studies, population genetics, genealogy, and forensics.1,2 Genotyping error occurs at known rates on genomic sequencing and array platforms.3 However, novel applications of sequencing or genotyping such as those encountered in forensics4 or ancient DNA,5 or novel downstream data analysis techniques such as imputation on low-pass whole-genome sequencing6–8 can result in genotyping errors that differ from well-established rates and types of errors that occur in typical genotyping and high-coverage sequencing in high-quality samples. Different applications or data analysis procedures can result in errors that can manifest in a variety of ways, for example, with differing rates of drop in or drop out of heterozygous calls.
There are several existing tools for simulating sequencing or genotype array data. Wgsim is a tool for simulating sequencing reads from a reference genome, allowing for a uniform user-defined substitution sequencing error.9 The ART sequencing read simulator allows for technology-specific read error models and base quality value profiles derived empirically from large sequencing datasets.10 Downstream of raw sequencing read error simulation, other tools provide methods for simulating genotyping data (as if variant calling had already been performed, or genotyping microarrays were being used to generate SNP genotypes). The ped-sim software allows for simulating large, arbitrarily complex pedigrees with genotypes drawn from founders in a variant call format (VCF) file,11 allowing for realistic familial data to be generated by employing sex-specific genetic maps and realistic crossover interference models.12 Ped-sim allows for specifying the rate of genotyping error and the rate of an opposite homozygous error conditional on a genotyping error at a marker, but does not allow for arbitrary specification of substitution rates in biallelic SNPs (such as different rates for heterozygous to homozygous reference drop-out versus homozygous reference to heterozygous drop-in), and it applies these universal error rates to all samples in a multi-sample VCF.
While there are tools available to simulate errors in sequencing reads, and other tools such as ped-sim which allow for broad specification of global error rate in genotype calls, to our knowledge there is no existing tool that allows for individual specification of the type and degree of genotyping error an arbitrary sample in a VCF (e.g., different rates for heterozygous to homozygous reference versus homozygous to heterozygous). The ability to precisely simulate the degree and kind of genotyping error (based on error probabilities discovered from prior upstream data analysis) is necessary to facilitate testing and evaluation of downstream applications which may be differentially sensitive to different types of genotyping errors. Here we describe our efforts to develop, validate, and demonstrate an application and usage of vcferr, a user-friendly python tool for probabilistically simulating arbitrarily specified genotyping error and missing data into VCF files. Source code, PyPI/bioconda package, and code to reproduce the results presented here are given in the Software availability section below.26,27
vcferr is written in Python and distributed to run directly from a command line interface. The tool is built using pysam,13 which internally manages reading and writing VCF file contents. When run on an input VCF, vcferr will load all variant records for a given sample. The tool checks each genotype, and randomly draws from a list of possible genotypes (heterozygous, homozygous for the alternate allele, homozygous for the reference allele, missing) with each element weighted by error rates specified by the user. The processing runs iteratively for every site in the input VCF, with the output streamed or optionally written to a new output VCF file. The following error and missingness models are available:
• rarr = Heterozygous drop out to homozygous reference: 0/1 or 1/0 to 0/0
• aara = Homozygous alternate to heterozygous: 1/1 to 0/1
• rrra = Homozygous reference to heterozygous drop in: 0/0 to 0/1
• raaa = Heterozygous to homozygous alternate: 0/1 or 1/0 to 1/1
• aarr = Homozygous alternate to homozygous reference: 1/1 to 0/0
• rraa = Homozygous reference to homozygous alternate: 0/0 to 1/1
• ramm = Heterozygous to missing: 0/1 or 1/0 to./.
• rrmm = Homozygous reference to missing: 0/0 to./.
• aamm = Homozygous alternate to missing: 1/1 to./.
We validated vcferr by reviewing concordance metrics between VCF files before and after error simulation. To perform this validation we used data from the 1000 Genomes Project14 (see Underlying data) and the companion error assessment tool described below. The code for accessing the 1000 Genomes Project data and for reproducing the analyses presented in this paper is available from Extended data.25 First, we retrieved all genotype calls for chromosome 22 from the 1000 Genomes file transfer protocol (FTP) site. Next, we simulated a set of error rates for one sample. We observed that the proportion of mismatches reported by our bcftools-based error assessment procedure aligned as expected to each of the specified error probabilities. We also confirmed that the simulation of error in one sample did not trigger any mismatches for other samples in the VCF. The chromosome 22 VCF includes 1,103,547 positions and 2504 samples, and genotype error simulation took approximately five minutes on a single CPU (central processing unit). The code used to fully reproduce this validation analysis is described in the Extended data section below.25
As part of this analysis we developed a companion tool (called nrc) to measure all types of error types that can be observed with biallelic SNPs between a reference and a query sample. This tool also computes the nonreference concordance (NRC), which is defined as:
Where err, era, and eaa, are defined as the the counts of mismatches for the homozygous reference, heterozygous, and homozygous alternate genotypes, and mra, and maa are the counts of the matches at the heterozygous and homozygous alternate genotypes (the number of homozygous reference matches, mrr, is omitted from this analysis). We implemented this procedure as a Docker container, which first runs bcftools stats13 followed by post-processing results using R. The Dockerfile needed to build the nrc image is available in the Software availability section below,27 where the Docker image can be pulled directly from the GitHub container registry.
vcferr is packaged in Python and available to be installed from PyPi or Bioconda. Once installed, the tool can be used through the command line without calling the Python interpreter. vcferr requires a path to an input VCF from which error will be simulated as the first positional argument. The tool also requires a specification for --sample for the ID of the sample to be simulated as it appears in the VCF header. All other error flags are optional, as is --output-vcf which if used will define the path to which the VCF should be written. If no value is passed to a given error or missingness option, then the probability will default to 0. The following is a minimal example of usage, which will result in 5% of the heterozygous genotypes in “sampleX” to be erroneously called as homozygous reference, and 1% of that same individual’s heterozygous genotypes to be erroneously called as homozygous alternate, with output being written to a new bgzipped VCF file:
# Install via pip or conda pip install vcferr conda install -c bioconda vcferr # Run vcrerr vcferr my.vcf.gz --sample='sampleX' --p_rarr=0.05 --p_raaa=0.01 --output_vcf=sampleX.err.vcf.gz
The companion tool nrc for assessing individual error rate probabilities and overall non-reference concordance is available as a Docker image from the GitHub container registry. Once installed, the container can be run through the command line with no dependencies other than Docker. The container expects two VCF files and a sample ID which exists in both VCFs to compare. Optionally, a set of sites at which to compare can be specified (default is to compare all sites). This is useful to restrict analyses to a subset of sites (e.g., those that have at least some minimum minor allele frequency from GnomAD, given that filtered site VCF). Example usage:
# Pull the image from the github container registry and tag: docker pull ghcr.io/signaturescience/nrc docker tag ghcr.io/signaturescience/nrc nrc # Run the nrc tool on sampleX in two vcfs in the current working directory, # restricting analyses to sites in targetsites.vcf.gz in the current directory. docker run --rm -v $(pwd):$(pwd) -w $(pwd) nrc \ -s sampleX \ -T targetsites.vcf.gz \ original.vcf.gz vcferr.vcf.gz
To demonstrate how vcferr could be used to address a research question, we conducted an analysis of the impact of genotyping error on pairwise kinship estimation. For instance, a heterozygous drop-out to homozygous reference may be much better tolerated than a switch from homozygous reference to homozygous alternate. We first generated simulated pedigree data using ped-sim12 with founders drawn from the British in England and Scotland (GBR) population in the 1000 Genomes Project data. The simulated data was masked to a subset of 590,241 sites, including content from the Illumina Global Screening Array. With all data originally simulated to include no error, we next identified an individual for which we would inject genotyping error. We used vcferr to iterate through each error mode from 0 to 20% error at 1% steps, holding all other error modes constant at 0.
# Loop through 0, .01, .02, …, .19, .20 for each error mode, # keeping all other error modes at zero vcferr pedigree.vcf.gz \ --sample='sample_to_mutate' \ --p_rarr=0.05 \ --p_aara=0 \ --p_rrra=0 \ --p_raaa=0 \ --p_aarr=0 \ --p_rraa=0 \ --output_vcf=pedigree.mutated.prarr.05.vcf.gz
Next, we obtained KING robust15 kinship coefficients from akt16 for all pairwise comparisons that included the error-simulated individual in the resulting simulation VCFs. We then used the kinship coefficients to infer relatedness out to the third-degree and compared accuracy of degree inference to truth from the original simulated pedigree. The results of this analysis are summarized in Figure 1. Code needed to reproduce this entire analysis, implemented as a Snakemake workflow,17 is available in the Zenodo archive described in the Extended data section below.25
In short, certain kinds of error dramatically decrease accuracy of relatedness inference with the KING robust estimator. Note that the sample for which errors were simulated included 430,056 homozygous reference sites, 105,413 heterozygous sites, and 54,772 homozygous alternate site, resulting in varying absolute counts of errors being simulated at each 0 to 20% step (as expected in any sample with genotypes from microarray or sequencing). By leveraging vcferr to inject a range of genotyping error scenarios, we can identify the scenarios that are most likely to impact kinship analysis.
Genome-wide association studies (GWAS) are a widely used approach for genetic epidemiology studies attempting to understand genetic underpinnings to complex human disease.18 Formalin-fixed, paraffin-embedded (FFPE) tissue fixation is a standard means to preserve tissues from clinical samples, and can result in genotyping error19 that can vary by type, with heterozygous errors being more commonly discordant.20 Power and sample size requirements for GWAS are impacted by genotyping error.21 If error probabilities for different sample types such as FFPE tissue samples are known, the impact of different types of genotyping error on GWAS power or sample size requirements can be estimated in simulation. One of many available tools can be used to simulate GWAS data with known sample size, effect size, SNP density, etc.22,23 Following data simulation, vcferr can be used to inject error in samples of interest, precisely controlling the degree and kind of error in each sample.
# Loop through id_1, id_2, id_n for sample_to_mutate vcferr gwas.vcf.gz \ --sample='sample_to_mutate' \ --p_rarr=0.05 \ --p_aara=0.01 \ --p_rrra=0.01 \ --p_raaa=0.05 \ --p_aarr=0.0001 \ --p_rraa=0.0001 \ --output_vcf=gwas.mutated.vcf.gz
Similarly, a polygenic risk score (PRS) captures the additive effect of hundreds, or perhaps thousands of SNPs, and are computed on genome-wide genotyping data using a prediction model based on summary statistics from external GWAS studies.24 PRS are first calculated by summing the risk alleles an individual has where each risk allele is weighted by the risk allele effect size as estimated by a large GWAS on that phenotype. The PRS is then applied to target genotype data to assess overall genetic liability to developing the trait or disease. If the target genotypes result from a process that introduces error at different levels for different zygosity (e.g., low-pass whole genome sequencing), the performance of a PRS will be impacted differently depending on the type and kind of error introduced. If this error profile is known by characterizing against high-quality samples, those error profiles can be used to inject error at those specified rates into samples simulated for assessing the predictive ability of a PRS in a new target genotype collection using the same approach above for GWAS.
vcferr provides a flexible, user-friendly interface to simulate genotyping error in VCF files. While tools exist to simulate entire VCFs with error (e.g. ped-sim12), to our knowledge at the time of writing there are currently no other tools that can simulate arbitrary individuals in the VCF at varying types and levels of error while leaving all other samples as-is. We anticipate that vcferr will be useful for researchers in multiple areas, including forensic genomics and medical genetics.
The 1000 Genomes Data14 used in this study is freely accessible to anyone without registration via the The International Genome Sample Resource website here: https://www.internationalgenome.org/category/ftp/.
• Archived analysis code at time of publication: https://doi.org/10.5281/zenodo.659913425
• License: Creative Commons Attribution 4.0 International license (CC-BY 4.0)
The vcferr python tool is the primary software tool described and demonstrated in this paper.
• Software available from: PyPI (https://pypi.org/project/vcferr/) and Bioconda (https://anaconda.org/bioconda/vcferr)
• Source code available from: https://github.com/signaturescience/vcferr
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.661179626
• License: MIT
The nrc companion tool was created for assessing error produced by vcferr
• Software (Docker image) available from: https://ghcr.io/signaturescience/nrc
• Source code available from: https://github.com/signaturescience/nrc
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.661178227
• License: MIT
The authors acknowledge Dr. Mike Coble for helpful discussions about the software and its use cases.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology, Bioinformatics, Population genetics, Identity by descent, Artificial Intelligence
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |
---|---|
1 | |
Version 1 11 Jul 22 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)