Genome-wide haplotype association analysis of primary biliary cholangitis risk in Japanese

Im, Cindy; Sapkota, Yadav; Moon, Wonjong; Kawashima, Minae; Nakamura, Minoru; Tokunaga, Katsushi; Yasui, Yutaka

doi:10.1038/s41598-018-26112-1

Download PDF

Article
Open access
Published: 17 May 2018

Genome-wide haplotype association analysis of primary biliary cholangitis risk in Japanese

Cindy Im¹,
Yadav Sapkota²,
Wonjong Moon²,
Minae Kawashima³,
Minoru Nakamura⁴,
Katsushi Tokunaga³ &
…
Yutaka Yasui^1,2

Scientific Reports volume 8, Article number: 7806 (2018) Cite this article

1281 Accesses
5 Citations
Metrics details

Subjects

Abstract

Primary biliary cholangitis (PBC) susceptibility loci have largely been discovered through single SNP association testing. In this study, we report genic haplotype patterns associated with PBC risk genome-wide in two Japanese cohorts. Among the 74 genic PBC risk haplotype candidates we detected with a novel methodological approach in a discovery cohort of 1,937 Japanese, nearly two-thirds were replicated (49 haplotypes, Bonferroni-corrected P < 6.8 × 10⁻⁴) in an independent Japanese cohort (N = 949). Along with corroborating known PBC-associated loci (TNFSF15, HLA-DRA), risk haplotypes may potentially model cis-interactions that regulate gene expression. For example, one replicated haplotype association (9q32–9q33.1, OR = 1.7, P = 3.0 × 10⁻²¹) consists of intergenic SNPs outside of the human leukocyte antigen (HLA) region that overlap regulatory histone mark peaks in liver and blood cells, and are significantly associated with TNFSF8 expression in whole blood. We also replicated a novel haplotype association involving non-HLA SNPs mapped to UMAD1 (7p21.3; OR = 15.2, P = 3.9 × 10⁻⁹) that overlap enhancer peaks in liver and memory T_h cells. Our analysis demonstrates the utility of haplotype association analyses in discovering and characterizing PBC susceptibility loci.

Regional heritability mapping identifies several novel loci (STAT4, ULK4, and KCNH5) for primary biliary cholangitis in the Japanese population

Article Open access 09 April 2021

rs1944919 on chromosome 11q23.1 and its effector genes COLCA1/COLCA2 confer susceptibility to primary biliary cholangitis

Article Open access 25 February 2021

A regulatory variant at 19p13.3 is associated with primary biliary cholangitis risk and ARID3A expression

Article Open access 28 March 2023

Introduction

Primary biliary cholangitis (PBC; also known as primary biliary cirrhosis, MIM 109720) is a progressive autoimmune disease of the liver, leading to the destruction of the bile ducts, and in end-stage cases, liver failure. While the factors that underlie PBC susceptibility and increase risk for disease progression remain enigmatic^1,2, a concordance rate of 63% for PBC has been observed in monozygotic twins, suggesting that PBC has a strong hereditary component³.

Recent genome-wide association studies (GWAS) in European cohorts have confirmed associations between the human leukocyte antigen (HLA) locus and identified over two dozen non-HLA PBC susceptibility loci^4,5,6,7. PBC GWAS in Japanese cohorts have only replicated a minor number of risk loci identified in European populations (e.g., IL7R, IKZF3, CD80), and revealed features of the PBC genetic architecture that may be dissimilar between populations, with TNFSF15, POU2AF1, and PRKCB emerging as major susceptibility loci among Japanese^8,9. While these results suggest that additional single-SNP GWAS with larger sample sizes are warranted, it is also worthwhile to consider complementary analytic approaches to gain further insights. One such approach involves the exploration of the effects of haplotypes, or the arrangement of multiple SNP alleles on the same chromosome. Haplotype patterns may not only be more powerful for mapping disease genes, but may also be uniquely informative about known single SNP associations, since haplotypes are also known to vary considerably between populations, bear signatures of selection, and may contextualize the role genetic variants play, singly or in tandem, in disease pathogenesis^10,11,12.

Among existing methods to conduct genome-wide haplotype association studies with unphased genotype data, the most popular strategy is to split the genome indiscriminately into overlapping “sliding” windows, and simultaneously infer frequencies for all possible haplotypes among a small, fixed number of SNPs within each window and test global associations under a regression-based framework^11,13,14. Due to the computational burden of simultaneously inferring and testing haplotypes, the selected windows for haplotype formation are small, and constructed haplotypes consider a limited number of contiguous SNPs. An important consequence of these analytic restrictions is that a comprehensive exploration of haplotypes as models for the cis-regulation of gene expression is typically not feasible, given that gene expression is modulated by haplotypes comprised of both proximal and distal SNPs with individual or interacting effects^15,16 and may be influenced by larger networks of SNPs (i.e., three SNPs or more)¹⁷.

In the current study, we conducted a genome-wide gene-based haplotype association analysis to uncover relationships between SNPs that are potentially transcriptionally-relevant and are also associated with PBC risk. We applied a regression-based methodology to find haplotypes consisting of three SNPs associated with PBC risk among sets of SNPs mapped to extended gene-centered windows in a discovery cohort of 1,937 Japanese, without restricting formed 3-SNP haplotypes to contiguous SNPs. Phased haplotypes inferred with whole chromosome SNP data were treated as observed in our downstream association analysis. We used a permutation-based approach to select top haplotypes and replicated selected findings in a second independent Japanese cohort (N = 949). Ancillary bioinformatics analyses were conducted to assess the biological plausibility of associations between PBC risk and detected 3-SNP haplotypes.

Results

Discovery and replication of genic haplotypes associated with PBC risk

We developed a novel approach to detect haplotypes associated with PBC risk by leveraging the logic regression¹⁸ methodology, a statistical learning method that employs a stochastic search algorithm to detect combinations of binary predictors (i.e., haplotypes) associated with an outcome of interest (i.e., PBC risk) within a generalized linear model (GLM) framework. After computationally resolving haplotype phase for the 2,886 participants in the combined study cohort, a total of 272,131 SNPs were mapped to at least one of 15,137 gene analytic windows, or broadly-defined gene regions that include RefSeq genes and the flanking 500-kb regions before and after transcription start and stop sites (Supplementary Fig. S1). Using the proposed methodology, we identified the most strongly associated haplotype consisting of three SNPs for every gene analytic window in our discovery cohort (N = 1,937). A total of 7,317 unique candidate 3-SNP haplotype associations were detected: 7,272 genic haplotypes included no SNPs in the HLA region (chr6:29645000–33365000, hg19 build), while 45 genic haplotypes included at least one HLA region SNP.

We used a permutation-based evaluation statistic¹⁹ to assess whether the candidate genic 3-SNP haplotypes had stronger associations with PBC risk than expected. Applying a pre-defined cut-off (top one percentile of permutation-based evaluation statistic values), 74 candidate 3-SNP haplotype associations were selected for replication follow-up. We employed a Bonferroni-corrected per-test significance threshold (P < 0.05/74 = 6.8 × 10⁻⁴) in our replication analysis with our independent cohort (N = 949). Under this p-value threshold, nearly two-thirds of selected 3-SNP haplotypes were replicated (49/74 haplotypes). All genic haplotypes with HLA region SNPs were replicated (37/37 trees), while 32.4% of genic haplotypes without HLA region SNPs (12/37 trees) were replicated. To contextualize the efficacy of our permutation-based evaluation statistic cut-off in capturing the strongest haplotype signals detected with logic regression, Fig. 1 contrasts the distribution of p-values for association tests in the replication cohort (N = 949) for the 74 selected 3-SNP haplotypes to the relatively uniform replication p-value distribution for the 7,243 dropped 3-SNP haplotypes.

Discovery and replication analysis results for the 49 replicated 3-SNP genic haplotypes are provided in Supplementary Table S1. Among replicated haplotypes, 69.4% (34) included at least one SNP that was not nominally significant, while 24.5% (12) contained no SNPs that individually achieved genome-wide significance (P < 5 × 10⁻⁸) (Supplementary Table S2). The magnitude of estimated ORs for replicated 3-SNP haplotypes in the combined cohort (N = 2,886) assuming additive haplotype effects ranged from 1.672 to 15.246 (inverting protective associations), with p-values ranging from 1.3 × 10⁻³⁵ to 3.9 × 10⁻⁹. Tables 1 and 2 highlight selected example results for five replicated 3-SNP haplotypes. Among these examples, (rs9295704 = C) and ((rs2451752 = A) and (rs2575174 = C)) on chromosome 6 (OR = 0.365, P = 8.9 × 10⁻¹⁵), and ((rs12671658 = T) or (rs12702656 = A)) and (rs11768586 = G) on chromosome 7 (OR = 0.066, P = 3.9 × 10⁻⁹) represent novel non-HLA loci associated with PBC risk (Table 2). Other examples provided correspond to genomic regions carrying single variants with the strongest associations with PBC risk in previous PBC GWAS in Japanese (near HLA-DRA and TNFSF15)^8,9.

Table 1 Selected examples of replicated 3-SNP haplotypes.

Full size table

Table 2 Single SNP and component 2-SNP haplotype effects for example replicated 3-SNP haplotypes.

Full size table

Missingness rates for replicated 3-SNP haplotypes in the unphased data among all available controls (median: 1.13%, IQR: 0.86%; N = 1,505) and cases (median: 1.09%, IQR: 0.80%; N = 1,381) were low and comparable between groups (Supplementary Table S3). Pre-phasing counts of unambiguous homozygous carriers and potential carriers of at least one haplotype copy were consistent with the estimated corresponding replicated haplotype counts in the phased data for all available controls (N = 1,505, Supplementary Table S4). Lastly, frequency distributions for each of the replicated 3-SNP haplotypes among all phased study controls (N = 1,505) and the phased 1000G JPT reference panel (N = 104) demonstrated that these distributions were comparable for each haplotype (Supplementary Table S4).

Comparing proposed and benchmark methodologies for haplotype detection

We applied a benchmark haplotype association methodology to conduct global tests of association¹⁴ for estimated haplotypes formed within all available sliding window sets comprised of three contiguous SNPs in each of the 15,137 gene analytic windows in our discovery cohort (N = 1,937). Using this benchmark method, we identified 1,425 haplotypes with p-values meeting a Bonferroni-corrected p-value threshold (P < 0.05/1,567,361 tests = 3.2 × 10⁻⁸) across 205 gene analytic windows (Table 3). Nearly two-thirds of the gene windows (135/205) with a top haplotype association detected by the benchmark method was a window that also contained a replicated 3-SNP haplotype detected with our proposed method, suggesting the proposed and benchmark methods found many of the same gene analytic windows to be important for haplotype exploration. No exact matches for 3-SNP haplotypes detected by logic regression were observed among top haplotypes identified by the benchmark method.

Table 3 Comparison of logic regression and benchmark methods to detect 3-SNP haplotypes in the discovery cohort, N = 1,937.

Full size table

Given that SNPs contributing to 3-SNP haplotypes detected by logic regression are not necessarily contiguous and estimated haplotypes were treated as observed in this analysis, we computed global test score statistics with variance estimates that account for the uncertainty in the haplotype estimation¹⁴ for estimated haplotypes consisting of the three SNPs in each of the replicated 3-SNP haplotype trees in the combined cohort (N = 2,886). The scale of each haplotype tree’s global test p-value appeared to be consistent with the haplotype p-value obtained with the proposed method (Supplementary Table S5).

Functional annotation of replicated haplotype SNPs

We conducted enrichment analyses to broadly investigate the biological plausibility of replicated 3-SNP haplotypes’ associations with PBC risk by comparing the set of 106 unique SNPs contributing to replicated 3-SNP haplotypes (“haplotype SNPs”) against an unpruned comparison set of 16,036 SNPs mapped to gene analytic windows with nominal univariate associations with PBC risk (P < 0.05). An examination of the number of gene expressions (eQTLs) significantly associated with haplotype SNPs in three blood and liver cell/tissue types (Genotype-Tissue Expression, GTEx v7)²⁰ demonstrated that haplotype SNPs were significantly enriched for eQTLs in lymphoblastoid cells relative to the comparison SNP set (P = 2.7 × 10⁻³) (Supplementary Table S6). The set of haplotype SNPs was also significantly enriched for overlaps with ChIP-seq histone modification peaks linked with enhancer (H3K4me1) or promoter (H3K4me3) activity in at least 20 of the 29 consolidated blood and liver cell types available in Roadmap Epigenomics Mapping Consortium²¹ (REMC) data (Bonferroni-corrected P < 0.05/29 = 1.7 × 10⁻³; Supplementary Tables S7, S8), respectively. Haplotype SNPs were also enriched for an indicator of open chromatin (DNase I peaks) in four blood cell types among 11 blood/liver cell types with REMC assay data (Bonferroni-corrected P < 0.05/11 = 4.5 × 10⁻³; Supplementary Table S9). Haplotype SNPs were jointly enriched for enhancer and promoter peaks in 15 blood/liver cell types, and simultaneously enriched for all three chromatin state indicators in three cell types in peripheral blood: primary B cells, primary T cells, and monocytes (Fig. 2). These results are consistent with reported immune-related PBC disease mechanisms that implicate B cell and T cell differentiation pathways²², and observations of significant inflammatory cell infiltration (including B cells, T cells, and macrophages) associated with the loss of biliary epithelial cells in the portal tract²³.

Figure 3 highlights two replicated 3-SNP haplotypes in the HLA region near rs3129887, the variant with the strongest single-SNP association with PBC risk in Japanese^8,9. Association testing results for haplotypes detected with logic regression and the benchmark method are shown in the top data track in Fig. 3 across chr6:32156782–32585905, followed by a visual summary of relevant functional annotations for SNPs in the replicated 3-SNP haplotypes. One of the replicated 3-SNP haplotypes shown is (rs3129881 = C) or ((rs375244 = A) and (rs3132947 = G)) (OR = 3.665, P = 2.3 × 10⁻²⁴), with SNPs in NOTCH4 and HLA-DRA introns; the other is ((rs9268831 = T) or (rs9269190 = T)) or (rs9270652 = C) (OR = 3.075, P = 7.3 × 10⁻²⁹), comprised of intergenic SNPs near HLA-DRA and HLA-DRB1. These haplotypes connect SNPs that may be linearly far apart, can consist of SNPs that do not achieve genome-wide significance (P < 5 × 10⁻⁸), and combine SNPs across multiple top 3-SNP haplotype windows tagged by the benchmark method. Additionally, all six haplotype SNPs overlap H3K4me1 and/or H3K4me3 peaks in primary B or T cells, while five SNPs have at least one significant eQTL in whole blood, lymphoblastoid, or liver cells.

Replicated 3-SNP haplotypes in chromosomes 7 and 9 include SNPs that overlap genomic regions tagged with the smallest p-values identified by the benchmark method across chr7:7667281–8026742 and chr9:116727079–118438852, respectively, while corresponding functional annotations contextualize the relative contributions of each SNP (Supplementary Figs S2 and S3). To clarify, two intergenic SNPs in the chromosome 9 (rs4979484 = C) or ((rs13300483 = T) and (rs7028891 = G)) haplotype with strong univariate associations with PBC risk (P < 5 × 10⁻⁸) also have significant associations with TNFSF8 expression in whole blood. However, this haplotype also includes rs4979484, an intergenic SNP with a relatively weak marginal association with PBC risk (P = 0.005) (Table 2). Interestingly, rs4979484 not only overlaps H3K4me1 and H3K4me3 peaks in multiple blood/liver cell types, but is also in a region with evidence of binding to nuclear factor kappa B (NF-kB), a transcription factor reported to play a critical role in inflammation and immunity processes²⁴, in a lymphoblastoid cell line of Japanese origin (GM18951; Table 4, Supplementary Table S10).

Table 4 Functional annotations of SNPs in selected replicated 3-SNP haplotypes.

Full size table

Discussion

In the current study, we propose a novel approach for the detection of haplotype associations to detect PBC risk haplotypes genome-wide. Previous studies have successfully combined agnostic sliding windows with an exhaustive testing strategy to identify haplotypes associated with disease risk. To contrast, our proposed method uses a logic regression-based stochastic search to detect best 3-SNP haplotype associations among phased SNP alleles mapped to broadly-defined gene regions. This approach has several strengths. First, the chosen analytic windows encourage searches for genic haplotypes, thereby finding combinations of variably-spaced SNPs that may influence cis-regulatory mechanisms for gene expression. Second, logic regression considers multiple models of risk (e.g., presence of risk alleles at either of two loci) that can better reflect regulatory redundancies that may exist in controlling transcription. Lastly, the proposed method avoids exhaustive testing, potentially capturing true haplotype associations that would otherwise be missed due to lack of statistical power. Using the proposed method, a total of 74 3-SNP haplotypes on chromosomes 6, 7, and 9 were considered as having stronger associations with PBC risk than expected under a permutation-based approach in a discovery cohort of 1,937 Japanese individuals. Nearly two-thirds of these selected haplotypes (49 haplotypes) were replicated in a second independent Japanese cohort (N = 949) after applying a Bonferroni-corrected p-value threshold (P < 6.8 × 10⁻⁴).

Haplotype association analyses using inferred haplotypes in downstream analyses may be vulnerable to Type I error inflation and biased estimates of genetic effects due to misclassification of haplotype states²⁵. Several aspects of this analysis mitigate these concerns. Phasing was conducted for cases and controls simultaneously, which provides greater control of Type I error than phasing these groups separately²⁵. Differential misclassification of haplotype states is unlikely, since haplotype phasing was conducted without knowledge of disease status. Non-differential misclassification is more plausible; thus, reported effect estimates may be biased towards the null (haplotype has no effect on PBC risk). To safeguard against false positives, we conducted a replication study and applied a Bonferroni-corrected p-value threshold to address multiple testing in replication. Our benchmark method analysis that considered the uncertainty of haplotype phase also tagged ~82% of the gene analytic windows containing replicated 3-SNP haplotypes. Lastly, the frequency distributions for replicated 3-SNP haplotypes among controls in our study were consistent with the 1000G JPT reference panel. These results suggest that our method can reliably detect credible 3-SNP genic haplotypes associated with PBC risk.

Limitations of our proposed method include not finding the haplotype with the strongest association (due to non-exhaustive testing) and missing risk haplotypes outside of genic regions. However, we replicated 49 out of 74 genic haplotype associations selected in discovery; replicated 3-SNP haplotypes also frequently included SNPs that overlapped top haplotype associations identified with the exhaustive testing-based benchmark method. Yet, exact matches between top haplotypes detected with the proposed and benchmark methods were not observed. Instead, replicated 3-SNP haplotypes detected by logic regression linked SNPs ~335 kb apart on average, with many contributing SNPs overlapping functional annotations in cell/tissue types relevant to PBC. Thus, the haplotypes detected with the proposed method: (a) frequently include SNPs that overlap genomic regions with top haplotype associations identified by an exhaustive testing-based benchmark method; (b) are unlikely to be detected with existing association methods; and (c) combine the effects of variably-spaced and potentially functional SNPs mapped to regions of the genome that are more likely to be transcribed.

Similar to recent PBC GWAS^4,5,8, we did not examine sex chromosome variants. However, sex chromosome-related defects and haplotype deficiencies likely play an important role in PBC etiology^26,27. Specifically, higher rates of X monosomy in peripheral blood cells in PBC-affected women have been observed, suggesting X monosomy influences PBC pathogenesis²⁸. Although skewed X-chromosome inactivation (XCI) may contribute to autoimmune disease risk^29,30, preferential X loss that involves particular X-linked haplotypes may better explain the increased X monosomy in women with PBC³¹. Analogous to this X haploinsufficiency observed in women with PBC, increased Y chromosome loss has been associated with PBC in men³². Further exploration of sex chromosome-specific haplotype associations with PBC risk is needed.

While we did not functionally validate replicated 3-SNP haplotypes, results from ancillary bioinformatics analyses suggest that identified haplotypes may be considered in future functional investigations of genetic susceptibility factors for PBC. SNPs in replicated 3-SNP haplotypes were significantly enriched for gene expressions in lymphoblastoid cells and indicators of enhancer, promoter, and open chromatin states in blood and liver cell types compared to SNPs in extended genic regions that were marginally associated with PBC risk. Specific functional annotations of SNPs in PBC-associated haplotypes also indicate that identified haplotypes may enhance understanding of posited disease mechanisms for both novel and known genetic associations. For example, the chromosome 7 haplotype, ((rs12671658 = T) or (rs12702656 = A)) and (rs11768586 = G), represents a novel candidate PBC susceptibility locus, with SNPs mapped to intronic regions of UMAD1. Two SNPs in this haplotype (rs12702656, rs11768586) overlap enhancer peaks in liver cells. A suggestive association with insulin-like growth factor 1 (IGF1) levels, a hormone predominantly produced by the liver, has been reported for a proximal variant mapped to UMAD1 (rs7780564, p = 3.9 × 10⁻⁷)³³; IGF1 is hypothesized to regulate biliary epithelial cell proliferation³⁴. On the other hand, TNFSF15 is reported to be the most strongly associated non-HLA genetic susceptibility factor for PBC in Japanese. TNFSF15 is anticipated to play a role in the inflammation response, activating Th1/Th17 cell differentiation and cytokine production⁹. The (rs4979484 = C) or ((rs13300483 = T) and (rs7028891 = G)) haplotype, comprised of intergenic SNPs near TNFSF15, implicate TNFSF8 as another possible contributor to PBC risk. This 3-SNP haplotype includes two SNPs significantly associated with TNFSF8 expression and a third SNP (rs4979484) with evidence of NF-kB binding in GM18951 (JPT) lymphoblastoid cells. TNFSF8 specifically encodes a cytokine (CD30L) expressed in T cells and monocytes; like TNFSF15, TNFSF8 is also a known activator of NF-kB, a transcription factor implicated in inflammation and immunity pathways²⁴.

In conclusion, this study presents a novel approach to conduct haplotype association analyses genome-wide, with a deliberate focus on interrogating regions of the genome that are likely to be transcribed. With this method, we identified novel candidate PBC susceptibility loci (e.g., UMAD1) and detected haplotype patterns that potentially contribute to hypothesized disease pathways that include previously reported PBC susceptibility genes (HLA-DRA, TNFSF15) in the Japanese population. Broader explorations of haplotypes may increase understanding of the genetic basis of PBC and potentially inform new interventions that improve disease prognosis.

Methods

Study population

The current study combines data collected for two previous PBC GWAS in Japanese, both coordinated by the Japan PBC-GWAS Consortium^8,9. All research participants provided informed consent. The methods/protocols implemented in this analysis were approved by the ethics committees of the Nagasaki Medical Center and all participating institutions. All methods were performed in accordance with relevant guidelines and regulations.

The combined study population has been previously described in detail⁸. Briefly, DNA samples were obtained from either healthy controls reporting no apparent disease or PBC cases, defined by laboratory or histological evidence of at least two of the following criteria: cholestasis, ascertained by elevated alkaline phosphatase; serum anti-mitochondrial antibodies; and non-suppurative destructive cholangitis and interlobular bile duct destruction. DNA samples were genotyped using the Affymetrix Axiom Genome-Wide ASI 1 Array (Affymetrix, Santa Clara, CA). As previously reported⁸, samples with excess heterozygosity rates, cryptic relatedness, or of non-Japanese ancestry were removed. A total of 425,290 autosomal SNPs were retained under the following quality control criteria: SNP call rate ≥95%; MAF ≥5%; and Hardy-Weinberg equilibrium (HWE) P ≥ 0.001 in controls. Samples with <97% sample call rate among retained SNPs were excluded, resulting in a study sample of 2,886 individuals. We split this sample a priori into two cohorts to correspond with the timing of sample collections for the two previous PBC GWAS: 1,937 participants (901 cases, 1,036 controls) for haplotype signal discovery, and 949 participants (480 cases, 469 controls) for haplotype signal replication.

Haplotype phasing

Haplotype phase was computationally estimated using SHAPEIT v2.79³⁵ for whole chromosomes with unphased SNP genotypes for all 2,886 unrelated samples in the combined study cohort. To improve phasing accuracy, we used 1000 Genomes Phase 3 genetic map recombination rates, increased the number of conditioning states for the haplotype estimation to 600 states, and increased the numbers of burn-in (10), pruning (10), and main iterations (50) of the SHAPEIT MCMC algorithm. Recommended parameters for haplotype estimation mean window size (2 Mb) and effective population size (15,000) for GWAS data were employed. Upon phasing, each sample was assigned its most likely haplotype phase configuration with binary-encoded phased genotypes at each SNP locus (0|0, 0|1, 1|0, or 1|1, where 0 = reference allele, 1 = alternative allele).

Statistical Analyses

Restricting haplotype formation within gene regions

We restricted our search for haplotype signals within regions centered on annotated protein-coding and non-translated RNA-encoding genes. Specifically, we used gene transcripts annotated by the RefSeq gene model (release 74, GRCh37/hg19 build)³⁶, employed ANNOVAR³⁷ to map SNPs in our dataset to introns, exons, and 3′/5′ untranslated regions, and identified flanking 500-kb regions for each RefSeq transcript, as flanking gene regions are critical for transcriptional events³⁸. SNPs mapped to a gene and flanking regions were considered as a single “gene analytic window” (Supplementary Fig. S1). To reduce each set of SNPs in a given gene analytic window to “tagging SNPs” in formulating haplotypes, we removed SNPs within each window sequentially so that no pair of SNPs was in high linkage disequilibrium (r² ≤ 0.8). Gene analytic windows with a minimum of two SNPs were retained for analysis.

Detecting 3-SNP haplotypes with logic regression

We identified gene-based haplotype patterns consisting of three SNPs associated with PBC risk with an adapted logic regression¹⁸ algorithm. Briefly, logic regression is a statistical learning method that utilizes a stochastic search algorithm to support the detection of higher order interactions associated with an outcome within a generalized linear model (GLM) framework. For our analysis, 3-SNP haplotypes are expressed mathematically as Boolean (true/false) variables that combine SNP alleles on the same chromosome, e.g., ((SNP1 = reference allele) and (SNP2 = alternative allele) and (SNP3 = alternative allele)) = {True, False}.

We utilized logic regression’s “simulated annealing” search algorithm (R “LogicReg” package, version 1.5.8) to build many possible haplotype predictors stochastically and evaluate them based on GLM model scores (e.g., deviance scores for logistic regression). Logic regression was applied to each gene analytic window, considering the contributions of two haplotypes per subject in our discovery cohort (N = 1,937), to identify candidate 3-SNP haplotype patterns associated with PBC risk. To stabilize the performance of the stochastic algorithm, we utilized 100 different initialization values for each gene analytic window and chose the model with the lowest deviance score among the 100 fits.

To select candidate 3-SNP haplotypes for replication, we used a permutation-based evaluation statistic based on previous work¹⁹. Specifically, for each gene analytic window, 20 permutations of PBC disease status were used to obtain model deviance scores under each permutation. We then calculated the 20 possible null distribution deviations between a deviance score under a given permutation and the median estimated with the remaining 19 permuted datasets, along with the corresponding median absolute deviation (MAD, a robust measure of variability) for each gene window’s set of 20 null distribution deviations. We defined our permutation-based evaluation statistic as $\frac{{D}_{obs}-{D}_{med}}{MA{D}_{{D}_{med}}}$, where D_obs is the observed deviance score for the gene analytic window’s best-fitting 3-SNP haplotype, D_med is its respective empirically-derived median under 19 permutations, and $MA{D}_{{D}_{med}}$ is the MAD of the null distribution deviations under 20 permutations.

Given gene analytic windows potentially overlap, the same best-fitting 3-SNP haplotype could be identified for multiple windows. In these cases, the 3-SNP haplotype with the best permutation-based evaluation statistic across multiple windows was retained to contribute to a set of unique 3-SNP haplotypes for replication follow-up (e.g., no haplotype tree had an exact SNP and logic tree structure match with another tree). We set an a priori threshold to select candidate 3-SNP haplotypes: 3-SNP haplotypes with the top 1% of permutation-based evaluation statistics among the set of unique 3-SNP haplotypes were selected for replication, translating to statistic values of less than −11.4. This cut-off translates to the selection of 3-SNP haplotypes with logistic regression model deviance scores at least 11.4 median absolute deviations away from corresponding medians estimated under the null distribution.

Since haplotype logic regression models consider two observations with the same case/control status from each subject (i.e., two haplotypes from two homologous chromosomes), we used the following logistic regression model for the best detected 3-SNP haplotype associations, treating each subject as an independent observation to satisfy GLM assumptions for valid statistical inference:

$$\mathrm{log}(\frac{p}{1-p})={\beta }_{0}+{\beta }_{1}L$$

(1)

where p is PBC risk and L represents additive haplotype effects of the best-fitting 3-SNP haplotype logic tree. The reported model deviance score, odds ratio, and p-value for each detected haplotype in the discovery cohort were derived from equation 1.

To replicate selected haplotype associations, we tested the selected 3-SNP haplotypes from the discovery cohort with the same model described in equation 1 in our independent replication cohort (N = 949) and applied a Bonferroni-corrected per-test significance threshold (P < 0.05/number of selected haplotypes). We further required odds ratios in the replication cohort to have the same direction as the discovery cohort to consider a 3-SNP haplotype as replicated.

Detecting 3-SNP haplotypes with a benchmark method

We used the R “haplo.stats” package¹⁴, commonly employed in secondary explorations of haplotypes in single-SNP GWAS^39,40, as our benchmark haplotype association analysis method. This method is used to obtain maximum likelihood estimates for haplotype frequencies using the expectation-maximization (EM) algorithm. Associations between haplotypes and disease risk are tested with a score statistic for a global test of haplotypes (h-1 degrees of freedom for h haplotypes, treating the most common haplotype as a reference), with variance estimates that account for the uncertainty in the haplotype estimation. We used this method to exhaustively test 3-SNP haplotypes consisting of contiguous SNPs (with estimated haplotype counts of at least 20), using a 3-SNP sliding window strategy with a skip length of one SNP, among SNPs in gene analytic windows.

Ancillary bioinformatics analyses

We performed functional annotations and enrichment analyses with HaploReg v4⁴¹, Roadmap Epigenomics Mapping Consortium (REMC)²¹, and Genotype-Tissue Expression (GTEx Analysis v7)²⁰ resources for SNPs in replicated 3-SNP haplotypes. We conducted enrichment analyses using REMC histone modification ChIP-seq peak (gappedPeak algorithm) data for H3K4me3 (promoter) and H3K4me1 (enhancer) marks, and DNase I hypersensitivity peak (narrowPeak algorithm) data for all available consolidated blood/liver human cell types (29 and 11 cell types available, respectively). For each cell type, we compared the set of SNPs in replicated haplotypes with a set of 16,036 non-overlapping SNPs mapped to gene analytic windows, each with nominal single-SNP associations (P < 0.05) with PBC risk. Frequencies of overlap between SNPs in each set and epigenomic peaks were counted in each cell type. We evaluated enrichment for each assay using Bonferroni-corrected p-values obtained from 2-sided Fisher’s exact tests. To investigate enrichments in gene expressions for SNPs in replicated haplotypes among the three blood- and liver-related tissue types available in GTEx, counts of significant cis-eQTLs (SNPs within +/−1 Mb of transcription start sites, q-value < 0.05) for haplotype SNPs were compared to the aforementioned comparison SNP set using a 2-sided Fisher’s exact test. Visualizations highlighting functional annotations of selected 3-SNP haplotypes were created with the “Gviz” R/Bioconductor package⁴².

Data availability statement

The genetic data analyzed in this study are available in the European Genome-phenome Archive under study accession number EGAS00001002915 (https://www.ebi.ac.uk/ega/studies/EGAS00001002915). All other analyzed data are included in this published article or the Supplementary Information file.

References

Carey, E. J., Ali, A. H. & Lindor, K. D. Primary biliary cirrhosis. Lancet. 386, 1565–75 (2015).
Article CAS PubMed Google Scholar
Kaplan, M. M. & Gershwin, M. E. Primary biliary cirrhosis. N. Engl. J. Med. 353, 1261–73 (2005).
Article CAS PubMed Google Scholar
Selmi, C. et al. Primary biliary cirrhosis in monozygotic and dizygotic twins: genetics, epigenetics, and environment. Gastroenterology. 127, 485–92 (2004).
Article PubMed Google Scholar
Cordell, H. J. et al. International genome-wide meta-analysis identifies new primary biliary cirrhosis risk loci and targetable pathogenic pathways. Nat. Commun. 6, 8019 (2015).
Article CAS PubMed PubMed Central Google Scholar
Mells, G. F. et al. Genome-wide association study identifies 12 new susceptibility loci for primary biliary cirrhosis. Nat. Genet. 43, 329–32 (2011).
Article CAS PubMed PubMed Central Google Scholar
Liu, X. et al. Genome-wide meta-analyses identify three loci associated with primary biliary cirrhosis. Nat. Genet. 42, 658–60 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hirschfield, G. M. et al. Primary biliary cirrhosis associated with HLA, IL12A, and IL12RB2 variants. N. Engl. J. Med. 360, 2544–55 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kawashima, M. et al. Genome-wide association studies identify PRKCB as a novel genetic susceptibility locus for primary biliary cholangitis in the Japanese population. Hum. Mol. Genet. 26, 650–9 (2017).
CAS PubMed Google Scholar
Nakamura, M. et al. Genome-wide association study identifies TNFSF15 and POU2AF1 as susceptibility loci for primary biliary cirrhosis in the Japanese population. Am. J. Hum. Genet. 91, 721–8 (2012).
Article CAS PubMed PubMed Central Google Scholar
Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics. 12, 703–14 (2011).
Article CAS PubMed PubMed Central Google Scholar
Liu, N., Zhang, K. & Zhao, H. Haplotype-association analysis. Adv. Genet. 60, 335–405 (2008).
PubMed Google Scholar
Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature. 419, 832–7 (2002).
Article ADS CAS PubMed Google Scholar
Tregouet, D. A. et al. Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nat. Genet. 41, 283–5 (2009).
Article CAS PubMed Google Scholar
Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. & Poland, G. A. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–34 (2002).
Article PubMed Google Scholar
Tao, H., Cox, D. R. & Frazer, K. A. Allele-specific KRT1 expression is a complex trait. PLoS Genet. 2, e93 (2006).
Article PubMed PubMed Central Google Scholar
Rockman, M. V. & Kruglyak, L. Genetics of global gene expression. Nature Reviews Genetics. 7, 862–72 (2006).
Article CAS PubMed Google Scholar
Hemani, G. et al. Detection and replication of epistasis influencing transcription in humans. Nature. 508, 249–53 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Ruczinski, I., Kooperberg, C. & LeBlanc, M. Logic regression. J. Comput. Graph. Stat. 12, 475–511 (2003).
Article MathSciNet MATH Google Scholar
Im, C. et al. Genome-wide search for higher order epistasis as modifiers of treatment effects on bone mineral density in childhood cancer survivors. Eur. J. Hum. Genet. 26, 275–86 (2018).
Article CAS PubMed Google Scholar
GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 348, 648–60 (2015).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature. 518, 317–30 (2015).
Article CAS PubMed PubMed Central Google Scholar
Nakamura M. Genetic Factors in the Pathogenesis of Primary Biliary Cirrhosis in Autoimmune Liver Diseases 157–69 (Springer, 2014).
Jones, D. E. Pathogenesis of primary biliary cirrhosis. Postgrad. Med. J. 84, 23–33 (2008).
Article CAS PubMed Google Scholar
Gaur, U. & Aggarwal, B. B. Regulation of proliferation, survival and apoptosis by members of the TNF superfamily. Biochem. Pharmacol. 66, 1403–8 (2003).
Article CAS PubMed Google Scholar
Lin, D. Y. & Huang, B. E. The use of inferred haplotypes in downstream analyses. Am. J. Hum. Genet. 80, 577–9 (2007).
Article CAS PubMed PubMed Central Google Scholar
Invernizzi, P., Selmi, C. & Gershwin, M. E. Update on primary biliary cirrhosis. Dig Liver Dis. 42, 401–8 (2010).
Article CAS PubMed PubMed Central Google Scholar
Webb, G., Siminovitch, K. & Hirschfield, G. The immunogenetics of primary biliary cirrhosis: a comprehensive review. J Autoimmun. 64, 42–52 (2015).
Article CAS PubMed PubMed Central Google Scholar
Invernizzi, P. et al. Frequency of monosomy X in women with primary biliary cirrhosis. Lancet. 363, 533–5 (2004).
Article PubMed Google Scholar
Özbalkan, Z. et al. Skewed X chromosome inactivation in blood cells of women with scleroderma. Arthritis & Rheumatology. 52, 1564–70 (2005).
Article Google Scholar
Brix, T. H. et al. High frequency of skewed X-chromosome inactivation in females with autoimmune thyroid disease: a possible explanation for the female predisposition to thyroid autoimmunity. The Journal of Clinical Endocrinology & Metabolism. 90, 5949–53 (2005).
Article CAS Google Scholar
Miozzo, M. et al. Preferential X chromosome loss but random inactivation characterize primary biliary cirrhosis. Hepatology. 46, 456–62 (2007).
Article CAS PubMed Google Scholar
Lleo, A. et al. Y chromosome loss in male patients with primary biliary cirrhosis. J Autoimmun. 41, 87–91 (2013).
Article PubMed Google Scholar
Kaplan, R. C. et al. A genome-wide association study identifies novel loci associated with circulating IGF-I and IGFBP-3. Hum. Mol. Genet. 20, 1241–51 (2011).
Article CAS PubMed PubMed Central Google Scholar
Alvaro, D. et al. The intrahepatic biliary epithelium is a target of the growth hormone/insulin-like growth factor 1 axis. J. Hepatol. 43, 875–83 (2005).
Article CAS PubMed Google Scholar
Delaneau, O., Zagury, J. F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods. 10, 5–6 (2013).
Article CAS PubMed Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–45 (2016).
Article PubMed Google Scholar
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Article PubMed PubMed Central Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 489, 57–74 (2012).
Article ADS Google Scholar
Turnbull, C. et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504–7 (2010).
Article CAS PubMed PubMed Central Google Scholar
Cargill, M. et al. A large-scale genetic association study confirms IL12B and leads to the identification of IL23R as psoriasis-risk genes. Am. J. Hum. Genet. 80, 273–90 (2007).
Article CAS PubMed Google Scholar
Ward, L. D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 40, D930–4 (2012).
Article CAS PubMed Google Scholar
Hahne, F. & Ivanek, R. Visualizing Genomic Data Using Gviz and Bioconductor. Methods Mol. Biol. 1418, 335–51 (2016).
Article PubMed Google Scholar

Download references

Acknowledgements

This study was supported by funding provided by the Alberta Machine Intelligence Institute (Cindy Im and Yutaka Yasui), Grants-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (Minoru Nakamura and Katsushi Tokunaga; #26293181, 17H04169, 15H05271), Grant-in-Aid for Clinical Research from the National Hospital Organization (Minoru Nakamura), Health Labor Science Research Grants from Research on Measures for Intractable Diseases, Intractable Hepato-Biliary Diseases Study Group in Japan (Minoru Nakamura), and the Japan Agency for Medical Research and Development (Katsushi Tokunaga; Platform Program for Promotion of Genome Medicine, 16km0405205h0101). The authors also thank Drs. Sambasivarao Damaraju and Irina Dinu for their contributions to this manuscript.

Author information

Authors and Affiliations

School of Public Health, University of Alberta, Edmonton, Alberta, T6G 1C9, Canada
Cindy Im & Yutaka Yasui
Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, 38105, USA
Yadav Sapkota, Wonjong Moon & Yutaka Yasui
Department of Human Genetics, Graduate School of Medicine, The University of Tokyo, Tokyo, 113-0033, Japan
Minae Kawashima & Katsushi Tokunaga
Department of Hepatology, Nagasaki University Graduate School of Biomedical Sciences and Clinical Research Center, National Hospital Organization Nagasaki Medical Center, Omura, Nagasaki, 856-8562, Japan
Minoru Nakamura

Authors

Cindy Im
View author publications
You can also search for this author in PubMed Google Scholar
Yadav Sapkota
View author publications
You can also search for this author in PubMed Google Scholar
Wonjong Moon
View author publications
You can also search for this author in PubMed Google Scholar
Minae Kawashima
View author publications
You can also search for this author in PubMed Google Scholar
Minoru Nakamura
View author publications
You can also search for this author in PubMed Google Scholar
Katsushi Tokunaga
View author publications
You can also search for this author in PubMed Google Scholar
Yutaka Yasui
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.N., K.T. and M.K. conceived and coordinated the Japan PBC-GWAS studies. C.I., Y.S. and W.M. processed data. C.I. and Y.Y. developed the analytic methodologies, performed the analysis, and composed the manuscript. All authors provided critical revisions and approved of the manuscript.

Corresponding authors

Correspondence to Cindy Im or Yutaka Yasui.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Im, C., Sapkota, Y., Moon, W. et al. Genome-wide haplotype association analysis of primary biliary cholangitis risk in Japanese. Sci Rep 8, 7806 (2018). https://doi.org/10.1038/s41598-018-26112-1

Download citation

Received: 16 January 2018
Accepted: 30 April 2018
Published: 17 May 2018
DOI: https://doi.org/10.1038/s41598-018-26112-1

This article is cited by

Several genetic variants associated with systemic sclerosis in a Chinese Han population
- Chenxi Liu
- Songxin Yan
- Yongzhe Li
Clinical Rheumatology (2023)
Regional heritability mapping identifies several novel loci (STAT4, ULK4, and KCNH5) for primary biliary cholangitis in the Japanese population
- Olivier Gervais
- Kazuko Ueno
- Masao Nagasaki
European Journal of Human Genetics (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Regional heritability mapping identifies several novel loci (STAT4, ULK4, and KCNH5) for primary biliary cholangitis in the Japanese population

rs1944919 on chromosome 11q23.1 and its effector genes COLCA1/COLCA2 confer susceptibility to primary biliary cholangitis

A regulatory variant at 19p13.3 is associated with primary biliary cholangitis risk and ARID3A expression

Introduction

Results

Discovery and replication of genic haplotypes associated with PBC risk

Comparing proposed and benchmark methodologies for haplotype detection

Functional annotation of replicated haplotype SNPs

Discussion

Methods

Study population

Haplotype phasing

Statistical Analyses

Restricting haplotype formation within gene regions

Detecting 3-SNP haplotypes with logic regression

Detecting 3-SNP haplotypes with a benchmark method

Ancillary bioinformatics analyses

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing Interests

Additional information

Electronic supplementary material

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Several genetic variants associated with systemic sclerosis in a Chinese Han population

Regional heritability mapping identifies several novel loci (STAT4, ULK4, and KCNH5) for primary biliary cholangitis in the Japanese population

Comments

Search

Quick links