ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

[version 1; peer review: 3 approved with reservations]
PUBLISHED 14 Jul 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Dependency hell and reproducibility are recurrent challenges. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with default thresholds which can be modified by the user, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.

Keywords

GWAS, Quality Control, GWAS pipeline, Nextflow, Imputation, SNPs, Genomic Variants, BioContainers, QC, Population Stratification, GWAS, Anaconda

Introduction

Genome-Wide Association Studies (GWAS) seek to identify genetic variants that have a statistically significant association to a trait, such as a disease or other phenotype of interest. GWAS has been widely employed in a large variety of applications, including network-based and machine learning approaches1. A rapid explosion in the quantity of genomic data has created the need for systematic and standardised quality control (QC). Assuring high quality of genomic data is necessarily a complex multi-step procedure with multiple challenges, but it is essential in order to ensure reproducible and reliable results14. Although there are well-established steps and good practices5,6, there is no standardised and universally followed workflow, which impacts on the reproducibility and comparability of results2,4.

Existing approaches, including semi-automated tools7, can involve a time-consuming "trial and error" approach, requiring the analyst to check the distributions of parameters in plots produced over many rounds of adjustments, and to manually enter commands in a long list of QC steps one-by-one, or in a series of shell scripts. The analyst may encounter incompatibility problems and installation difficulties as well as spending valuable time familiarizing themselves with a number of different tools that sometimes lack detailed documentation. Software architecture tools such as nextflow and BioContainers can address these issues and have been proposed as automated solutions8; however restrictions exist in terms of limited and relatively rigid QC analysis, lacking such steps as imputation, limited variety of threshold choice and plot outputs, and the requirement for users to have extensive knowledge of the software in order to tailor their analysis. Here we present snpQT (shown in Figure 1): a standardised, flexible, automatic pipeline tool that provides comprehensive quality control, with imputation and association analysis, including ready-to-publish graphs and plots for data interpretation and validation for every QC step.

adea3abc-276b-4ac4-9907-6f9270af357f_figure1.gif

Figure 1. Outline of the snpQT architecture, which includes nine core workflows (A–I) that are implemented using nextflow.

To provide a reproducible and scalable pipeline, snpQT automatically loads software dependencies using Anaconda, docker, or environment modules. Each workflow expects specific inputs either from the user or from the outputs generated by other workflows. Main tools and key processes (modules) are highlighted in green. Examples of different task combinations are represented in the upper-right corner, showing the flexibility and interactivity among the implemented workflows. VCF: Variant Call File; QC: Quality Control; PCA: Principal Component Analysis; GWAS: Genome-Wide Association Studies.

Methods

Implementation

snpQT was developed as a set of nine core workflow components implemented with the nextflow workflow management system9. Each workflow component consists of independent containerised modules, using BioContainers curated by the bioinformatics community wherever possible10. Combining independent containerised modules into workflows, and enabling multiple workflow combinations, using the nextflow architecture, enables snpQT to be a reproducible and uniquely versatile tool for the analysis of human variant genomic data. Containers improve end-user experience and enable reproducible research by automatically provisioning bioinformatics software as required and improving numerical stability11. Running individual modules in independent containers also solves a common problem when installing potentially incompatible software packages, known as "dependency hell"12. In addition, nextflow enables caching at continuous checkpoints, so users can alter thresholds without needing to rerun earlier parts of the analysis. Briefly, if a module has the same input and parameters that a previous pipeline run has already processed, then the cached work is passed to the next module in the workflow instead of recomputing new work. This means that if a user runs multiple jobs and changes parameters at a later stage in the overall pipeline, then earlier unchanged stages are skipped, saving time.

As each genomic study is unique, this requires a tailored and flexible pipeline with informative representations of intermediate quality control data. snpQT is designed to offer multiple combinations of workflows as well as modifiable threshold parameters for multiple steps (as shown in Figure 1). Workflow A runs only once, performing a local database set up, downloading and preparing reference files13,14 and setting up specific versions of tools using Anaconda, docker or environmental modules. Workflow B has been created for the user to remap their genomic dataset from build 38 to 37 and vice versa. Workflow C performs Sample QC, including checks for missing call rate, sex discrepancies, heterozygosity, cryptic relatedness, and missing phenotypes. Workflow D performs Population Stratification: after an internal QC of the reference genome and user’s dataset, the two datasets are merged and prepared for the automatic removal of ethnic outliers using EIGENSOFT15. Principal Component Analyses are carried out before and after the outlier removal. Workflow E performs the main Variant QC, checking missing call rate, Hardy-Weinberg equilibrium deviation, minor allele frequency, missingness in case/control status, and generates covariates for GWAS, based on a user-modifiable number of Principal Components (or users may provide a covariates file). Workflow F is for pre-imputation quality control, preparing the dataset for imputation, while Workflow G performs local phasing and imputation using shapeit416 and impute517. Workflow H performs post-imputation QC where poorly imputed variants are removed, different categories of duplicated variant IDs are handled and the phenotypes of the dataset are updated. The workflows’ structure also allows for users to upload their data to an external imputation server, or use a different reference panel. Workflow I performs GWAS with and without adjustment of covariates (if the covariates are not provided by the user, snpQT uses the first five Principal Components generated from population stratification in Workflow D), outputting summary statistics, along with a Manhattan plot and a Q-Q plot.

As it can be challenging to choose the correct threshold for a metric, snpQT provides a "Make report" module in each of the main Workflows C, D, E, and I, that provides interactive HTML reports summarising all the plots for both before and after the chosen thresholds have been applied, enabling the user to easily inspect and check if the chosen thresholds are appropriate. Detailed summary logs and graphs are also provided throughout, depicting the total number of samples and variants in each step, for users who need easy and fast inspection of the processes, as well as for users who want a more-in-depth report prompting users towards the locations of intermediate files and logs.

Operation

snpQT is implemented in nextflow, R and Unix command line utilities. The minimum software requirements to run snpQT are Java 8, nextflow v20.10.0, and a POSIX-compatible operating system (tested on CentOS 8). The hardware requirements scale with input data and workflows: typically quality control checks require less than 16GB of RAM and 4 cores on large datasets of 40,000 individuals. However, imputation requires significant computing power - up to 50GB of RAM per chromosome per core. As well as those already listed, the following tools are used: picard (https://broadinstitute.github.io/picard/), PLINK18, PLINK2.019, samtools20, and snpflip (https://github.com/biocore-ntnu/snpflip).

The latest release of the 1,000 Genomes Project data13 is used as a reference panel in both VCF and processed PLINK2 formats14. A part of the population stratification and variant QC implementation was inspired by the work of 6. Optional software requirements include Docker, Anaconda, and Environment Modules which provide a simple method to install and run the underlying collection of bioinformatics software described above.

  • Anaconda is suitable for users who are not interested in performing local imputation and who do not have root access in their machines. Users can still run pre-imputation and post-imputation QC, as well as all the remaining QC-related workflows of snpQT.

  • Docker requires root access, which enables the installation of impute5, which is used for imputation

  • Environment modules are useful to run all stages of the pipeline in HPC environments, where root access is not available, but require some user configuration because installed packages and package names are custom to each HPC environment

Full documentation of snpQT, including an installation guide, a Quickstart explanation of workflow combinations and commands, a complete description of workflows, and an in-depth tutorial are provided at https://snpqt.readthedocs.io/en/latest/. The following Use Case section gives examples of input and output with explanatory context, and explains all of the key parameters needed to make use of snpQT.

Use case

This section provides a guide through the snpQT Quality Control pipeline, explaining the steps and demonstrating the application of the tool, using a synthetic dataset which is free and available with the tool. Plots shown are based on this synthetic dataset, which has some artificial structure but is adequate for demonstration purposes. The online tutorial includes plots with natural distributions, derived from a real-world Amyotrophic Lateral Sclerosis dataset of 2,000 samples (1,000 cases and 1,000 controls) taken from a restricted-access dbGaP project21.

Installation

Before downloading and running snpQT, depending on their needs, the user should have downloaded nextflow (v20.10.0) and Docker or Anaconda. To begin installation, the repository can be cloned and set up can be initiated by running the following commands:

git clone https://github.com/nebfield/snpQT.git
cd snpQT
nextflow run main.nf

Before starting to use any of the implemented workflows, it is necessary to set up a local database of reference and auxiliary files that snpQT requires to run. Because of the large volume of reference data in imputation workflow, we have designed two types of database. The first is the core reference database, which is sufficient for build conversion, sample and variant quality control, population stratification, pre-imputation and post-imputation. The second database is required only for local imputation, and downloading the latest release of the 1,000 Genomes Project data. In either case, it is necessary to first set up the core database:

nextflow run main.nf -profile [conda/docker/modules]
--download_db core

The above command will download the core reference files, prepare them and store them in a db/ folder in the snpQT directory. On the test system this took around 1 hour to complete, but this will vary depending on the system and network. The core dataset requires ~43GB of initial storage of intermediate files, which are stored in the work/ directory. These can be removed after set up using nextflow run main.nf –download_db core && rm -r work, leaving only 19.7GB of reference files that are stored in the database directory db/.

Alternatively, an already processed .tar.gz file (17.3GB, when unzipped it requires 19.7GB of space) of the required reference data can be downloaded directly from our servers, using the following lines of code:

mkdir db/
cd db
wget https://sys-myo.com/snpQT/core.tar.gz
tar -xvf core.tar.gz

If the user is interested in local imputation then the following command should also be run (in addition to the previous commands for the core database set up):

nextflow run main.nf -profile [conda/docker/modules]
--download_db impute

This command will download the imputation reference files and store them in a snpQT/db/impute directory. Downloading and processing the imputation data may take one to two hours or more, depending on the system and network. These reference files require an additional 30GB of storage. Alternatively, an already processed .tar.gz (13GB, when unzipped it requires 15GB of space) imputation reference file can be downloaded directly from our servers, using the following lines of code, and taking around 25 minutes:

mkdir db/
cd db
wget https://sys-myo.com/snpQT/impute.tar.gz
tar -xvf impute.tar.gz

The snpQT workflows generate many intermediate files that are not shown directly to the user, and are usually not needed by the user. Nextflow stores these in the snpQT/work/ directory by default. Due to the large size to which this directory can rapidly grow, it is recommended that the work folder be deleted after setting up the reference databases, using the command shown above. The same command can also be used to regularly clear the work/ folder as needed, such as after running multiple analyses using snpQT. However, while these intermediate files are retained, snpQT will remember previous work that it has done, such that it may automatically avoid needless repetition of previously completed work if asked to run a different stage of the pipeline, or to run again with tweaked parameters, on the same input data. When the work/ folder is deleted, all work for the analysis will need to be redone.

Datasets

After the download and set up of the database, analysis can then be done. A synthetic demonstration dataset is available with snpQT, which can be used to gain familiarity with the workflows and modules, while ensuring reproducible results identical to those shown in this section. The synthetic dataset is located within the data/ folder, and consists of a .vcf.gz file and three binary plink files (.bed, .bim and .fam). The dataset contains 6,517 randomised genotypes of chromosome 1, derived from 100 female samples having balanced binary phenotypes (i.e. 51 cases vs 49 controls). The chromosome positions, alleles and SNP IDs have been updated according to human 1,000 Genomes Project data (hg37).

Human genome build conversion

The Human Genome Build Conversion workflow converts genomic files aligned in build 38 to build 37 (default mode) and vice versa, using Picard’s LiftoverVcf utility. snpQT assumes that your input genomic data are aligned to build 37, as some of the workflows are designed to accept this input. Despite that GRCh37 human reference genome is not the most recent one, it is the most frequently used build among current public reference genomic datasets (e.g. 1,000 Genomes data, Haplotype Reference Consortium panel), online imputation servers (e.g. Sanger Imputation Server) and available snp-array datasets, and for this reason snpQT has been designed to support GRCh37 (hg19) for most workflows. Hence, this workflow can be helpful for users with data aligned to build 38 to convert them to build 37, in order to run QC and population stratification. The workflow may also be helpful when a user has finished their main QC and wish to upload their data to an external imputation server that uses a reference panel aligned in b38 (i.e. TOPMed) or b37 (i.e. Haplotype Reference Concortium panel), or to run a local imputation using a reference panel which is aligned in another build.

The genomic build of the synthetic dataset (which is aligned to b37) can be converted to b38 by running the following code, with input files and options explained below:

nextflow run main.nf -profile conda -resume \
  --vcf data/toy.vcf \
  --fam data/toy.fam \
  --convert_build \
  --input_build 37 \
  --output_build 38 \
  --results convert_build_toy/

  • Input files:

    • ––vcf: This workflow requires a valid VCF file of human genomic data

    • ––fam: This workflow also requires an accompanying plink .fam file which should contain the same samples as the VCF file.

  • snpQT options:

    • ––convert_build: runs the build conversion workflow

    • ––input_build: defines the build of the input data [37/38 (default)]

    • ––output_build: defines the build of the output data [37 (default)/38]

    • ––mem [16 (default)] : assigns the memory size that the LiftoverVCF utility can use

    • ––results: specifies the directory where the output files are stored. To retain results from separate analyses, the name of the results folder should be changed between runs.

  • Nextflow options:

    • –resume: runs multiple jobs using cached files (so skipping processes which are not affected by new changes). When running a different stage of the pipeline on the same input data, this will cause snpQT to avoid needless repetition of work already done.

    • –profile conda can be replaced with –profile [docker/modules] depending on your installation and needs (e.g. whether local imputation is needed or not, HPC requirements).

When this work is run successfully, a new folder will be created named convert_build_toy/ which contains a files/ subfolder which contains three binary plink files (the .fam file contains updated phenotype information) aligned to b38 and a .vcf.gz file for users who prefer this format for other purposes.

Main quality control

snpQT’s main QC workflow is divided into two distinct nextflow modules: Sample and Variant QC. Sample and Variant QC can be run using the parameter –-qc. The required input files are binary plink files which can be imported using the parameters –-bed, –-bim and –-fam, for .bed, .bim and .fam plink files, respectively. The main checks of Sample QC are listed below, with accompanying parameters and explanation:

  • Missing variant call rate check: Remove very poor quality SNPs based on call rate (snpQT default threshold is 0.1). These SNPs will be removed anyway at the variant QC stage, and applying the filter here avoids unnecessary removal of samples that may otherwise be of good quality.

  • Missing sample call rate check: Remove samples with lower than 98% call rate using the –-mind parameter (snpQT default threshold –-mind 0.02). The distribution for all samples using histograms and scatterplots is visualised before and after the applied threshold.

  • Check for sex discrepancies: Remove problematic samples for which (1) pedigree sex does not agree with the predicted sex based on sex chromosome homozygosity or (2) there is no sex phenotype using plink (default mode). This step can be skipped by setting the –-sexcheck false parameter.

  • Removal of non-autosomal SNPs: The default mode of snpQT is to keep the sex chromosomes. If the user wishes to remove the sex chromosomes, the –-keep_sex_chroms false parameter can be set.

  • Heterozygosity check: Identify and remove heterozygosity outliers (samples that deviate more than 3 units of Standard Deviation from the mean heterozygosity). The distribution of the samples’ heterozygosity is visualised by a histogram and a scatterplot. Extreme heterozygosity implies inbreeding and/or DNA contamination. This step can be skipped using the –-heterozygosity false parameter.

  • Check for cryptic relatedness and duplicates: Check for cryptic pairs of relatives or duplicated samples using plink2’s relationship-based pruning. The pruning threshold can be changed using the --king_cutoff [0.125 (default)] parameter.

  • Removal of samples with a missing phenotype: Remove samples with missing phenotypes. As missing phenotype here we refer to phenotype status (i.e. case/control status, the last column in the plink .fam file). The default option in snpQT is to skip this step, but it can be performed using the parameter –-rm_missing_pheno true.

The second part of the main QC is Variant QC, which is again implemented using the --qc parameter. It is considered good practice to first filter low quality samples in order to reduce the risk of removing a potentially high-risk variant during Variant QC. For this reason, the population stratification workflow (if chosen to be run by the user, as explained in the next subsection), which is essentially a Sample QC step, is designed to run between Sample QC and Variant QC, as seen in Figure 1. The Variant QC module contains the following steps:

  • Missing variant call rate check: Remove poor quality SNPs using the parameter --variant_geno (snpQT default threshold is --variant_geno 0.02).

  • Hardy-Weinberg equilibrium (HWE) deviation check: Remove SNPs that significantly deviate from the Hardy-Weinberg equilibrium (HWE) (snpQT default threshold is p-value <10e-7), indicating a genotyping error, and visualise the distribution of SNPs with extreme deviation. This threshold can be changed using the parameter --hwe.

  • Minor Allele Frequency (MAF) check: Remove SNPs with low MAF (snpQT default threshold is 0.05) and visualise the MAF distribution. This threshold can be changed using the --maf parameter in snpQT. Rare SNPs (having a very low MAF) are usually considered as false-positives and need to be excluded from further analysis.

  • Missingness in case/control status check: Remove SNPs with a statistically significant association of missingness (low call rate) and case/control status (snpQT default threshold is p-value <10e-7). The threshold can be changed using the parameter --missingness. This check can not be performed for quantitative data. If the user’s data are not binary, the --linear parameter can be used to skip this step.

  • Generate covariates using the first X Principal Components of each sample: Perform Principal Component Analysis and visualise the 2D and interactive 3D PCA plots annotating the samples by the phenotype status. If the GWAS workflow is called (using the parameter --gwas), use by default the first 3 Principal Components (PCs) to account for inner population structure. The number of PCs can be adjusted using the --pca_covars parameter which can take as input a number from 1 to 20, with 1 starting from the first Principal Component of the PCA. Prior to the PCA, snpQT keeps only independent markers performing variant pruning using PLINK. This behavior can be controlled using the parameter --indep_pairwise ["50 5 0.2" (default)].

The synthetic toy dataset does not contain sex chromosomes, so to avoid plink producing an error it is important to add --sexcheck false to skip the step which checks for sex discrepancies. Main QC on the toy dataset can be performed by running the following command:

nextflow run main.nf -profile conda -resume \
  --bed toy.bed \
  --bim toy.bim \
  --fam toy.fam \
  --qc \
  --results results_toy/ \
  --sexcheck false

On completion, the results_toy/ directory will now contain as many folders as the number of workflows that were run. Based on the example given, a results_toy/qc/ folder should be present, containing the following sub-folders and files (this structure will be similar for most other workflows):

  • A bfiles/ folder including the binary plink files of the last step of the corresponding module

  • A figures/ folder including all generated plots for the steps that have been run within this particular workflow, as well as log plots summarising the number of samples and variants in each step of the workflow

  • A logs/ folder including two .txt files (sample_qc_log.txt and variant_qc_log.txt) summarising details about the numbers of samples, variants, and phenotypes for each step of Sample and Variant QC, as well as the working directory where the intermediate files for each process are stored, so that it is easier for the user to inspect the results

  • Two .html reports for Sample and Variant QC summarising all the "before-and-after the threshold" plots generated in each step, as well as a plot demonstrating the number of samples and variants in every step.

In Figure 2 and Figure 3, we show some examples of the output plots from the Sample and Variant QC workflows for the toy dataset. Figure 2 illustrates a sample call rate histogram and a scatterplot for the toy dataset, before and after the default threshold has been applied (indicated by a red line). Figure 3 shows one of the last processes of the Variant QC workflow, where Principal Component Analysis is performed on the clean dataset both for data exploration and for generation of covariates using the first X Principal Components, which can then be used in the GWAS workflow, to account for a potential inner population sub-structure.

adea3abc-276b-4ac4-9907-6f9270af357f_figure2.gif

Figure 2.

Sample call rate for synthetic toy dataset shown as (a) a histogram and (b) a scatterplot, before and after applying the default threshold of 0.02% (red line). This synthetic randomised dataset was created for demonstration purposes, thus the sample call rate distribution may not closely resemble a real-world dataset.

adea3abc-276b-4ac4-9907-6f9270af357f_figure3.gif

Figure 3. 3D Principal Components Analysis (PCA) of the synthetic dataset.

The samples are annotated based on their phenotype (e.g. case/control status). The 3D PCA plot is available in an interactive environment incorporated into the .html reports. PCA plots are also provided in a 2D format for the first three Principal Components.

Population stratification

The aim of the Population Stratification workflow is to identify and then remove potential ethnic outliers, based on population structure, using the EBI’s latest release of a processed phased 1000 Genomes Project reference panel, aligned to human genome build 37. Population stratification is an essential step in QC analysis, since it minimises the possibility that the difference in the allele frequencies is caused by the different ancestry of the samples. The population stratification workflow requires the main QC workflow.

During this workflow, internal processing of both the 1,000 Genomes data and of the user’s dataset are performed, and then the two datasets are merged, keeping only mutual SNPs shared by both. The internal processing consists of numerous QC steps, some of which can be tailored by the user, by passing the following parameters:

  • --indep_pairwise ["50 5 0.2" (default)] : Control PLINK’s variant pruning process

  • --variant_geno [0.02 (default), 0-1]: Remove poorly genotyped variants

When both datasets are prepared and merged, snpQT creates a racefile labeling the race of each sample. User's samples are automatically labeled as "OWN". The race label for the 1,000 Genomes data can be controlled by the --racefile parameter, using super-population labels (e.g. EUR, AFR, AMR) by default or subpopulation labels using the --racefile [super (default), sub] parameter. When the racefile and the merged dataset are ready, EIGENSOFT's smartpca software is performed for automatic outlier removal. Smartpca takes a set of parameters, which are in the form of a file. snpQT provides the option to change this file according to the users' needs using the --parfile parfile.txt parameter. Lastly, the user can choose to infer eigenvectors based on a population subset list in smartpca using the parameter --racecode ["" (default), "EUR"/"AFR"/"SAS"].

To perform population stratification on the toy dataset the user can run the following command:

nextflow run main.nf -profile conda -resume \
  --bed toy.bed \
  --bim toy.bim \
  --fam toy.fam \
  --qc \
  --pop_strat \
  --results results_toy/ \
  --sexcheck false

After successful completion, a new sub-folder pop_strat/ will be created within the –-results directory, along with the previous qc/ folder. Since the -resume parameter was used, the Sample QC processes have been cached, making the pipeline run faster. As it was mentioned above, Population Stratification runs in between Sample and Variant QC, which means now that --pop_strat is combined with --qc the Variant QC, input files have changed and therefore, the corresponding processes will run again with the updated input. Within the pop_strat/ folder, the following are included:

  • A bfiles/ folder including three binary plink files and a .log file coming from the last process of the --pop_strat workflow

  • A figures/folder including six 2D plots for the first three Principal Components, two 3D plots in a .rds format for an interactive user experience for both before and after ethnic outlier removal using EIGEN-SOFT and lastly, two .log plots summarising the number of samples and variants in each step of the workflow

  • A logs/ folder including a pop_strat_log.txt file summarising details about the numbers of samples, variants, and phenotypes for each step of the workflow, as well as the working directory where the intermediate files for each process are stored, so that it is easier for the user to inspect the results

  • A popstrat_report.html report summarising all the aforementioned plots, as well as hosting an interactive environment for the 3D plots.

Figure 4 shows the Principal Component Analysis (PCA) plot for the synthetic toy dataset. The PCA topology is quite artificial here, as this is a synthetic dataset made in plink2, containing only a few thousand genotypes of chromosome 1, which are subsequently pruned, leaving a few hundred independent SNPs merged with 1000 Genomes Project data. A more natural example of PCA plots resulting from real-world data is shown in the online tutorial for the ALS dataset.

adea3abc-276b-4ac4-9907-6f9270af357f_figure4.gif

Figure 4. 3D Principal Components Analysis (PCA) of the synthetic dataset following combination with the 1000 Genomes Project data.

The samples are annotated based on their ancestry, except for the user’s data which are labelled as "OWN". Two 3D PCA plots are available in an interactive environment incorporated in the .html reports, representing before and after ethnic outlier removal using EIGENSOFT. PCA plots are also provided in a 2D format for the first three Principal Components. Since the synthetic toy dataset is artificial and contains only a few hundred of randomised independent markers, it is located a large distance from the 1000 Genome data. For this reason, the before-and-after outlier removal PCA plots are identical, so only one is shown here.

GWAS

This workflow performs both logistic and linear regression for binary and quantitative phenotypic traits, respectively, using the parameter --gwas. snpQT performs a logistic regression by default, but it is also designed to run linear regressions using the --linear parameter. These analyses can be performed with and without adjusted covariates to account for a fine-scale population structure. Covariates can be calculated at the end of the --qc workflow (preferably used with population stratification workflow) using the first X Principal Components of the generated PCA, using the --pca_covars [3 (default), 1-20] parameter; alternatively, covariates can be passed directly from the user as an argument with --covar_file [false (default), covar.txt]. The covar.txt file should follow the same format as a PLINK covariate file. The GWAS workflow requires the main QC workflow to run in advance. For a logistic regression analysis on the toy dataset, the user can run the following command (the following example assumes that the user wishes to run population stratification as well, although this is not obligatory):


nextflow run main.nf -profile conda -resume \
  --bed toy.bed \
  --bim toy.bim \
  --fam toy.fam \
  --qc \
  --pop_strat \
  --gwas \
  --results results_toy/ \
  --sexcheck false

The command above causes a total of 39 separate snpQT processes to run, and be completed in ~2 minutes on the development system. When GWAS has finished running successfully, a new gwas/ sub-folder will be created within the results_toy/ directory, along with the previous pop_strat/ and qc/ folders mentioned above. Within the gwas/ folder the following are included:

  • A files/ folder including the plink2 Generalised Linear Regression results of GWAS analyses, both with and without adjusted covariates, accompanying log files and GWAS files with adjusted p-values for multiple-testing corrections.

  • A figures/ folder including Quantile-Quantile (Q-Q) plots and Manhattan plots for the GWAS results, both with and without covariates, and log plots illustrating the number of samples and variants at each step.

  • A logs/ folder including a gwas_log.txt file summarising details about the numbers of samples, variants, and phenotypes for each step of the corresponding workflow, as well as the working directory where the intermediate files for each process are stored, so that it is easier for the user to inspect the results.

  • An .html report summarising all the aforementioned plots.

Figure 5 and Figure 6 show Q-Q and Manhattan plots, with and without covariate adjustment for the synthetic toy dataset, respectively.

adea3abc-276b-4ac4-9907-6f9270af357f_figure5.gif

Figure 5.

Q-Q (Quantile-Quantile) plots for the synthetic dataset both (a) with and (b) without covariate adjustment, with their accompanying lambda values. This synthetic randomised dataset was created for demonstration purposes using the default thresholds, thus the Q-Q distributions may not closely resemble a real-world clean dataset.

adea3abc-276b-4ac4-9907-6f9270af357f_figure6.gif

Figure 6.

Manhattan plots for the synthetic dataset both (a) with and (b) without covariate adjustment. Each dot represents a variant located based on its genomic position (x-axis) and GWAS -log(p-value) (yaxis). The synthetic dataset at this stage contains 2,312 variants located in chromosome 1.

Pre-Imputation, imputation & post-imputation

The Pre-Imputation workflow prepares a genomic dataset for imputation, including fixing issues such as flipping SNPs that are on the reverse strand, removing ambiguous and duplicated SNPs and fixing the reference allele. If the user wishes to run the Pre-imputation workflow independently (i.e. the user is not interested in local imputation) then it is required to combine the workflow with the main QC workflow (and only optionally with the Population Stratification workflow). When Pre-Imputation QC is run independently, the --pre_impute argument is used and it cannot be combined with the GWAS, Imputation and Post-Imputation workflows. To run Pre-Imputation QC on the synthetic dataset, the user can run the following command:

nextflow run main.nf -profile conda -resume \
  --bed toy.bed \
  --bim toy.bim \
  --fam toy.fam \
  --qc \
  --pre_impute \
  --results results_toy/ \
  --sexcheck false

Pre-imputation workflow creates a preImputation/files/ directory, which contains a compressed VCF and an indexed VCF files which are ready for imputation

snpQT also offers an optional Imputation workflow, using the parameter –-impute, where the user can increase the number of markers of their genomic dataset, using EBI’s latest release of the phased 1,000 Genomes Project reference panel aligned to human genome build 37. When the --impute workflow is used, the Pre-Imputation and Post-Imputation workflows are called internally before and after phasing, and local imputation takes place, accordingly. As explained above, Pre-Imputation prepares the user’s dataset for phasing using shapeit4 and local imputation with impute5. When phasing and imputation per chromosome are finished, then the Post-Imputation workflow takes as input the imputed chromosomes, and filters out all poorly imputed variants, based on Info score and MAF. The user can alter these filters using --info [0.7 (default), 0-1] and --impute_maf [0.01 (default), 0-1] parameters, respectively. The Post-Imputation workflow also annotates missing SNP IDs and handles different categories of duplicated SNPs.

To run local imputation on the synthetic toy dataset the user can run the following command:

nextflow run main.nf -profile docker -resume \
  --bed toy.bed \
  --bim toy.bim \
  --fam toy.fam \
  --qc \
  --pop_strat \
  --impute \
  --gwas \
  --results results_toy/ \
  --sexcheck false

When imputation is complete, separate preImputation/ and post_imputation/ folders are created. The post_imputation/ directory contains the following sub-folders:

  • A bfiles/ folder including three binary plink imputed files and a .log file coming from the last process of the Post-Imputation workflow

  • A figures/ folder including log plots illustrating the number of samples and variants at each step

  • A logs/ folder including a post_impute_log.txt file summarising details about the numbers of samples, variants, phenotypes and the processes of the workflow

Despite that the Post-Imputation workflow can be nested under the --impute workflow, it is also designed to run independently; as some users may prefer to run imputation on external online servers, or may have already imputed data and they wish to proceed only with a Post-Imputation QC. The accepted input files for Post-Imputation are a valid VCF file and an accompanying plink .fam file, which should both contain the same samples. To run Post-Imputation QC on the synthetic toy dataset, the user can run the following command:

nextflow run main.nf -profile conda -resume \
  --vcf toy.vcf \
  --fam toy.fam \
  --post_impute \
  --results results_toy/

As already explained above, the results directory contains a new post_imputation/ folder containing the same elements as in the Imputation workflow.

Lastly, snpQT provides a command-line help page which is continuously updated, using the parameter --help. We provide further information about the implemented snpQT processes, the synthetic toy dataset and the real-world ALS dataset results in the online documentation at https://snpqt.readthedocs.io/en/latest/.

Conclusions

The snpQT tool offers robust QC combined with scalability, reproducibility, flexibility and user-friendly design which can appeal to a broad spectrum of users. It is a stand-alone software, implemented as a modular nextflow DSL2 workflow. No additional coding nor manual installation/download of any data or other program are required apart from nextflow and Anaconda or Docker. We have designed environments of standard bioinformatics tools for each stage of the workflow, to ensure software version consistency and to improve user experience. The input for snpQT is a VCF file and/or binary plink files, formats which are widely used. The architecture of snpQT provides a thorough QC analysis (inspired and tested by 6,22,23), including parameters and generating plots, for both before and after the provided threshold for the majority of the steps. Outputs include interactive .html reports, summative .log files and graphs summarising all the results for easier inspection. For users who have limited experience with QC analysis, a thorough "how-to" guide and step-by-step tutorials are provided, using the demonstration dataset that is available with the tool, which can also be informative to users who wish to be newly acquainted with QC analysis.

Software availability

Source code available from: https://github.com/nebfield/snpQT

Archived source code at time of publication:

Zenodo: nebfield/snpQT: v0.1.3 -Fluffy penguin, https://doi.org/10.5281/zenodo.508309324

License: GNU v3.0

Data availability

Underlying data

Zenodo: snpQT reference data (Version 0.1), http://doi.org/10.5281/zenodo.491646925

  • This project contains the processed reference data required by snpQT to function. snpQT can download and process raw data from scratch to create reference data, as described in the installation section, or the data above can be downloaded and used to save time.

Zenodo: nebfield/snpQT: v0.1.3 -Fluffy penguin, https://doi.org/10.5281/zenodo.508309324

  • This project contains the synthetic dataset used in the Use Case section, which is distributed with the source code.

NCBI dbGaP: Mega-GWAS ALS I. Accession number, phs000101.v5.p1: https://identifiers.org/dbgap:phs000101.v5.p1

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 14 Jul 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Vasilopoulou C, Wingfield B, Morris AP and Duddy W. snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data [version 1; peer review: 3 approved with reservations] F1000Research 2021, 10:567 (https://doi.org/10.12688/f1000research.53821.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 14 Jul 2021
Views
35
Cite
Reviewer Report 01 Nov 2021
Anna Ulrich, University Of Surrey, Guildford, UK 
Approved with Reservations
VIEWS 35
Vasilopoulou et al. presents a comprehensive and integrated workflow for the quality control (QC), assessment of population stratification, genome build conversion, imputation and genome-wide association (GWA) analysis of called genetic variants for those without much experience in bioinformatics. However, it ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ulrich A. Reviewer Report For: snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data [version 1; peer review: 3 approved with reservations]. F1000Research 2021, 10:567 (https://doi.org/10.5256/f1000research.57243.r92711)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 29 Nov 2021
    William Duddy, Northern Ireland Centre for Stratified Medicine, University of Ulster, Derry/Londonderry, BT47 6SB, UK
    29 Nov 2021
    Author Response
    We thank the reviewer for their time and constructive comments.

    Regarding the use of the small synthetic dataset: The use case followed in the manuscript is intended to show ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 29 Nov 2021
    William Duddy, Northern Ireland Centre for Stratified Medicine, University of Ulster, Derry/Londonderry, BT47 6SB, UK
    29 Nov 2021
    Author Response
    We thank the reviewer for their time and constructive comments.

    Regarding the use of the small synthetic dataset: The use case followed in the manuscript is intended to show ... Continue reading
Views
35
Cite
Reviewer Report 26 Aug 2021
Stephanie M. Gogarten, Department of Biostatistics, University of Washington, Seattle, WA, USA 
Approved with Reservations
VIEWS 35
The snpQT software is a useful resource for researchers working on Genome-Wide Association Studies (GWAS). It bundles a wide collection of software together and provides a clearly defined workflow based on best practices in GWAS quality control. It will be ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Gogarten SM. Reviewer Report For: snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data [version 1; peer review: 3 approved with reservations]. F1000Research 2021, 10:567 (https://doi.org/10.5256/f1000research.57243.r91453)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 29 Nov 2021
    William Duddy, Northern Ireland Centre for Stratified Medicine, University of Ulster, Derry/Londonderry, BT47 6SB, UK
    29 Nov 2021
    Author Response
    1. We thank the reviewer for noting this very important oversight, and apologise for the misuse of the terminology. We have updated our code, documentation and manuscript, changing the
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 29 Nov 2021
    William Duddy, Northern Ireland Centre for Stratified Medicine, University of Ulster, Derry/Londonderry, BT47 6SB, UK
    29 Nov 2021
    Author Response
    1. We thank the reviewer for noting this very important oversight, and apologise for the misuse of the terminology. We have updated our code, documentation and manuscript, changing the
    ... Continue reading
Views
48
Cite
Reviewer Report 11 Aug 2021
Andries T. Marees, Department of Economics, School of Business and Economics, VU University Amsterdam, Amsterdam, The Netherlands 
Approved with Reservations
VIEWS 48
This paper by Vasilopoulou et al. is well written and presents a potentially interesting software pipeline for genomic qc. Furthermore, the accompanying website provides a clear overview of all steps and options. 

Major concerns:

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Marees AT. Reviewer Report For: snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data [version 1; peer review: 3 approved with reservations]. F1000Research 2021, 10:567 (https://doi.org/10.5256/f1000research.57243.r89658)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 29 Nov 2021
    William Duddy, Northern Ireland Centre for Stratified Medicine, University of Ulster, Derry/Londonderry, BT47 6SB, UK
    29 Nov 2021
    Author Response
    We thank the reviewer for their time and very constructive comments.

    1A) We apologise if the reviewer had problems installing snpQT, and we have taken further steps, detailed below, ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 29 Nov 2021
    William Duddy, Northern Ireland Centre for Stratified Medicine, University of Ulster, Derry/Londonderry, BT47 6SB, UK
    29 Nov 2021
    Author Response
    We thank the reviewer for their time and very constructive comments.

    1A) We apologise if the reviewer had problems installing snpQT, and we have taken further steps, detailed below, ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 14 Jul 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.