Abstract
Motivation Genome-Wide Association Studies (GWAS) commonly assume phenotypic and genetic homogeneity that is not present in complex conditions. We designed Transformative Regression Analysis of Combined Effects (TRACE), a GWAS methodology that better accounts for clinical phenotype heterogeneity and identifies gene-by-environment (GxE) interactions. We demonstrated with UK Biobank (UKB) data that TRACE increased the variance explained in All-Cause Heart Failure (AHF) via the discovery of novel single nucleotide polymorphism (SNP) and SNP-by-environment (i.e. GxE) interaction associations. First, we transformed 312 AHF-related ICD10 codes (including AHF) into continuous low-dimensional features (i.e., latent phenotypes) for a more nuanced disease representation. Then, we ran a standard GWAS on our latent phenotypes to discover main effects and identified GxE interactions with target encoding. Genes near associated SNPs subsequently underwent enrichment analysis to explore potential functional mechanisms underlying associations. Latent phenotypes were regressed against their SNP hits and the estimated latent phenotype values were used to measure the amount of AHF variance explained.
Results Our method identified over 100 main GWAS effects that were consistent with prior studies and hundreds of novel gene-by-smoking interactions, which collectively accounted for approximately 10% of AHF variance. This represents an improvement over traditional GWAS whose results account for a negligible proportion of AHF variance. Enrichment analyses suggested that hundreds of miRNAs mediated the SNP effect on various AHF-related biological pathways. The TRACE framework can be applied to decode the genetics of other complex diseases.
Availability All code is available at https://github.com/EpistasisLab/latent_phenotype_project
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was funded by NIH grants LM010098, AG066833, and 5T32HG000046
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The North West Multi-centre Research Ethics Committee (MREC) of the UK Biobank gave ethical approval for this work. All UK Biobank participants consented to their data's use for research purposes, and our study was conducted in compliance with the UK Biobank's guidelines. While our study did not require additional ethics approval, we ensured that all practices followed the ethical guidelines provided by the UK Biobank. All data used in our study are anonymized, and no individual-level data are reported in a way that could lead to the identification of participants. Therefore, specific consent to publish from individual participants is not applicable. Please refer to the UK Biobank's website for further details.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data availability
The input data are available for other researchers via the UKB’s controlled access scheme [69]. The procedure to apply for access [70] requires registering with the UK Biobank and compiling an application form detailing:
A summary of the planned research
The UK Biobank data fields required for the project
A description of derivatives (data, variables) generated by the project
In addition, several publicly available bio-informatics tools with associated databases were used in this study:
Genevestigator: We used this to compare genes containing SNP hits to the genes’ DE in response to cigarette smoke. It can be accessed at https://genevestigator.com/.
LDTrait tool: We used this to find established GWAS SNP hits in LD with those of our study. It can be accessed at https://ldlink.nih.gov/?tab=ldtrait.
DisGeNET: This public platform’s gda-scores were used to quantify the evidence that genes containing some of our SNP hits are related to cardiovascular disease. It is available at https://www.disgenet.org/search.
FUMA: We used this to select all genes within 300kb of our SNP hits. It is available at https://fuma.ctglab.nl/.
MSigDB: We used this for our enrichment analyses. It can be accessed at https://www.gsea-msigdb.org/gsea/msigdb/index.jsp
miEAA: We used this for our miRNA enrichment analysis. It is available at https://ccb-compute2.cs.uni-saarland.de/mieaa.
Access to these online resources is publicly available, but specific usage may require user registration. Please refer to each resource’s respective website for details on access, data use policies, and terms of service.
List of abbreviations
- AHF
- All-Cause Heart Failure
- AHD
- atherosclerotic heart disease
- CHD
- coronary heart disease
- CI
- Confidence Interval
- DE
- Differential Expression
- GBC
- Gradient Boosting Classification
- GO
- Gene Ontology
- GWAS
- Genome-Wide Association Studies
- GxE
- gene-by-environment
- ICC
- Intraclass Correlation Coefficient
- KB
- kilobases
- LD
- Linkage Disequilibrium
- LR
- Logistic Regression
- MICE
- Multivariate Imputation by Chained Equations
- PCA
- Principal Component Analysis
- SNP
- single nucleotide polymorphism
- TRACE
- Transformative Regression Analysis of Combined Effects
- UKB
- United Kingdom Biobank