Introduction

Analysis of the numerous mutations present in cancer genomes is expected to substantially contribute to our understanding of the causes of malignancy and eventually to the development of personalized treatment plans. DNA sequence contexts of mutations in tumors can provide insights into the mechanisms of mutagenesis in cancer1,2,3. The ‘mutational signature’ approach was introduced in the 1990s4,5,6 and has been successfully applied to delineate the roles of AID and DNA polymerase η in somatic hypermutation in humoral immunity5,7,8, editing APOBEC3s cytosine deaminases in hypermutagenesis in retroviruses9 and the formation of dimers versus 6–4 photoproducts in UV- mutagenesis10. Recently, this methodology has become popular in the analysis of cancer genomes3,11. As predicted after the discovery of DNA editing by AID and APOBEC cytosine deaminases12, mutations in DNA sequence contexts similar to mutations induced by deaminases in model systems have been found in several types of cancer2,3,13,14. Studies of mutations induced by deaminases are facilitated by their unique properties, namely, the ability to produce, in vitro, clustered mutations in ssDNA at specific contexts surrounding cytosines8,15,16 and retention of the signatures of deaminase-induced mutagenesis and propensity for clustered mutations, kataegis in vivo, in heterologous models where no potential specific cofactors are expected to be present17,18,19,20,21,22,23.

Based on DNA sequence context and other approaches, it has been shown that AID, which generates mutations of C-G pairs in the WRC/GYW motif (Fig. 1, upper row, mutated base pair is underlined), could contribute to gastric and haemopoietic cancers24,25, whereas APOBEC3A (A3A) and A3B (TCW/WGA motif, Fig. 1, fourth row) potentially contribute to breast, lung and many other cancers17,26,27,28. A recent report indicates that deaminase-induced clusters of mutations mark signatures of accelerated somatic evolution in cancer gene promoters in lymphoma29. Mutational signatures are critical in the analysis and are subject to continuous refinement30. Here we describe a novel, unexpected mutational signature of AID deaminase that is linked to DNA CpG methylation. We initially identified the hybrid signature in follicular lymphoma and then in more than a half of all types of human cancers.

Figure 1
figure 1

Mutable DNA sequence motifs analyzed in this work.

Variants of DNA sequences corresponding to a defined motif (left column, bold) are shown to the right in the double-stranded DNA form. Mutation-prone bases are in red and underlined.

Results and Discussion

We analyzed over 13,000 base substitutions found in follicular lymphoma (FL) in 22 patients (Supplemental Table 1). Mutations at G-C base pairs were 1.5 times more frequent than mutations at A-T pairs; the number of transversions was approximately equal to the number of transitions. The overall pattern of base substitutions in FL has similarities both to the classic distribution of types of changes during spontaneous mutagenesis in humans31 and to somatic hypermutation of immunoglobulin genes7 (Supplemental Fig. 1). However, the FL mutational spectrum showed alterations in the ratios of transversions in G-C pairs, namely a two-fold relative increase in the fraction of G-C to T-A and a two-fold decrease in the fraction of G-C to C-G transversions, which could be a sign of modulation of processes of DNA damage and translesion DNA synthesis at G-C pairs32.

Examination of the DNA sequence context of mutations in FL showed that the bias was caused by a significant excess of substitutions in CpG dinucleotides, with the implication that the mechanism of these mutations is linked to cytosine methylation/demethylation33,34. Briefly, the analysis was performed as follows. We calculated the excess of mutations in specific motifs using the ratio Fm/Fn, where Fm is the fraction of mutations observed in the particular motif, and Fn is the frequency of the motif in the respective DNA neighborhood (defined as a 120 bp DNA sequence window, Supplemental Dataset S1). A 2.3-fold excess of mutations (defined as described in Materials and Methods) in CG/CG dinucleotides was detected (Table 1, row 1). In contrast, there was no association between mutations and the TCW/WGA motif, indicating that APOBEC1 and APOBEC3 are not involved in mutagenesis in FL (Table 1, row 2). Instead, we detected the signatures of AID and of Pol η (Table 1, rows 3–6), which are known as mutators involved in immunoglobulin genes somatic hypermutation (SHM) at G-C and at A-T base pairs, respectively35. Unexpectedly, however, the most strongly over-represented motif was WRCG/CGYW, which is a combination of the AID motif WRC/GYW and the CpG dinucleotide; in contrast, no connection between WRC/GYW and somatic mutations was found in non CpG sites when CpG was masked (Table 1, last three rows). Notably, SHM in immunoglobulin genes shows the opposite trend whereby somatic mutations are substantially underrepresented in CpG-containing motifs36. Thus, the mutational process in FL appears to be distinct from the conventional SHM and is likely associated with CpG methylation/demethylation processes. AID deaminates 5-methylcytosine in characteristic AID-target sequence contexts, and the footprint of AID-induced mutagenesis has been found in oncogenes mutated in tumous37. Deamination of methylated cytosines by AID and APOBECs38 is thought to contribute to a variety of genetic and epigenetic processes39,40,41,42, which potentially could be compromised in FL cells, leading to AID-dependent mutagenesis.

Table 1 Association between known mutable motifs and the DNA sequence context of somatic mutations in exomes of follicular lymphoma.

The only deviation from this novel mutation pattern in FL was found in 5′UTRs where SHM appears to operate in the “standard immunoglobulin mode” (significant correlation of mutation context with WRCH/DGYW and WA motifs, Supplemental Table 2). Although elevated mutagenesis was observed in CpG dinucleotides and CGYW motif similar to other gene regions, the two processes did not overlap and the hybrid signature was not detected. The 5′UTRs are known to be preferentially targeted by deaminases in active genes43,44,45, therefore the hybrid motif might be masked by numerous AID and other deaminases-induced mutations.

We analyzed AID-related WRC/GYW and WRCG/CGYW motifs for 22 individual FL patient exomes (Supplemental Table 3). A significant excess of both motifs was found for 13 patients. This finding suggests that the mutational processes associated with AID are active in FL to the extent detectable with sensitive statistical tests in samples with limited number of mutations. To determine whether the observed excess of WRCG/CGYW motifs could be a simple consequence of an extremely high mutability of CpG dinucleotides, we compared the relative frequencies of mutations in the WRCG/CGYW motifs and in CpG-containing contexts that do not contain the WRC/GYW motif, namely YCG/CGR and SNCG/CGNS, in different cancer cell lines. In FL and in many other cancers, there was a highly significant excess of mutations in WRCG/CGYW compared to the motifs lacking WRC (Table 2) indicating that the overlap of the AID motif and CpG indeed is the unique mutagenesis signature. In a diverse collection of cancer genomes, we found a significant excess of WRCG/CGYW motifs in two distinct types of blood cancer with the highest representation in the COSMIC data set, as well as in 9 out of 14 analyzed solid tumors from various tissue types, particularly in stomach cancer. Among tissues without an excess of mutated WRCG/CGYW motif, skin has an exceptionally low rate of mutations in this motif, consistent with the previous observations that a different motif (YCG/CGR) is hypermutated in human skin cancers46,47. Importantly, the signatures characteristic to AID activity are detectable specifically in cancer genomes. For control, we examined the context of somatic mutations in various normal tissues48 and did not find any significant excess of AID-related mutable motifs, either CpG-containing or not (Supplemental Tables 4 and 5). The size of these datasets are limited, but power analysis (Materials and Methods) suggested that the absence of any significant excess of AID-related mutable motifs likely reflects genuine biological properties of these samples.

Table 2 Difference between the mutability of AID motifs WRC/GYW with vs without an extra 3′ GC pair (Fig. 1) in various cancer genomes (the Sanger COSMIC Whole Genome Project)*.

The striking abundance of mutations in WRCG/CGYW motifs in tumors implies that AID is sufficiently active in many human cancer types to skew the mutation distrubition towards the AID WRC/GYW motifs. These observations are in line with the previous findings on the involvement of AID in gastric cancers25 and the growing evidence on the role of AID in CpG demethylation in some genomic regions40,49. We analyzed the mutability of WRC/GYW motifs in various cancer genomes from COSMIC and observed that almost half of the cancer types (6 of the 16) show a significant excess of mutations in these motifs (Table 3). The high mutation prevalence in the “pure” AID motif strongly correlates with that in the hybrid “AID and CpG” motif across the range of cancers. However, the apparent correlation is not perfect and the excess of mutations in WRC/GYW is generally weaker (Fig. 2). The cancers without excess of mutations in WRCG/CGYW (breast, bladder, cervix, lung, skin) show no increased mutability of the WRC/GYW motif either. The difference in the mutability patterns between the two motifs in part can be explained by the greater statistical power of the more informative WRCG/CGYW motifs compared to WRC/GYW motifs. When the involvement of AID is not supported at a statistically significant level through the WRC/GYW motif, it might is still act at CpG dinucleotides causing a significant deviation from the expected mutation frequencies for the WRCG/CGYW motif.

Table 3 Preferential mutability of WRC/GYW and somatic mutations in various cancer Whole Genomes and Whole Exomes (the Sanger COSMIC Whole Genome Project)*.
Figure 2
figure 2

Tumors types with mutation enrichment in the hybrid AID/CpG motif tend to possess an excess of mutations with pure AID signature.

We next compared the expression levels of the AICDA gene, which encodes AID, between the TCGA cohorts. Quartiles and extrema were calculated for each TCGA cohort selected in the study (Supplementary Fig. 2). The observed high variability in AICDA gene expression in B-cell Lymphoma (DLBC) is on par with the observation of widely varyng levels of AICDA expression in peripheral blood mononuclear cells of patients with B-CLL50. The expression levels in all other tumor tissues are within the range where definitive conclusions cannot be made based on the data currently available in TCGA (Supplementary Fig. 2). In most tumor cohorts, however, the quantitative profile of the expression values represented by the five numbers summary (and especially the high variability of AICDA expression; see Supplementary Fig. 2) closely follows the one of B-cell lymphoma, which is consistent with the hypothesis presented here.

We next analyzed mutations and the overall level of methylation (% of methylated cytosines or methylation ratio) for 26 patients with malignant lymphoma (https://dcc.icgc.org/projects/MALY-DE, see Methods for details). Consistent with our previous findings (Tables 1 and 3), there is a substantial excess of mutations in WRCG/CGYW and WRC/GYW motifs (4.91 times and 1.53 times, respectively, P < 10−10 for both motifs). Analysis of the relative frequencies of mutations in the WRCG/CGYW motifs and in CpG-containing contexts that do not contain WRC/GYW, namely YCG/CGR and SNCG/CGNS, also relealed a highly significant excess (1.5 times, P < 10−10) of mutations in motifs containing AID-mutable WRC/GYW, indicating that the overlap of the AID motif and CpG is indeed the signature of mutation process in malignant lymphoma similar to other blood cancers (Table 2). Examination of the association between the methylation ratio and somatic mutations in WRCG/CGYW mutable motifs identified a moderate but significant decrease of methylation in the WRCG/CGYW mutation context. The mean methylation ratios for the WRCG/CGYW mutation positions and non-CGYW mutation positions (YCG/CGR and SNCG/CGNS) were 74.8 and 79.4 respectively (p < 0.0001 according to the sampling test; see Methods for details). The histogram in Fig. 3 shows that the major difference is within the range of methylation ratios of 80 and 100, i.e. in mutation positions with large methylation ratios. This finding is consistent with the hypothesis that AID-dependent demethylation preferentially occurs in WRCG/CGYW mutable motifs so that mutations are one of the outcomes of the multistep demethylation process37. No significant difference between the WRCG/CGYW mutable motifs and non- WRCG/CGYW contexts was found for all genomic positions without taking into account somatic mutations in the same set of methylated CpGs (https://dcc.icgc.org/projects/MALY-DE, mean values of the methylation ratio are 73.9 and 74.6, respectively) although the slight overall decrease in the methylation ratio in WRCG/CGYW motifs might have biological implications. These findings are compatible with the hypothesis that AID is involved in demethylation of methylated cytosines during cancer initiation and/or progression.

Figure 3
figure 3

The methylation ratio in WRCG motifs and non- WRCG motifs (YCG/CGR and SNCG/CGNS motifs).

The fraction of motifs in each bin (0–20% methylation ratio, 20–40% methylation ratio, etc.) is shown.

The analysis of mutations in cancer genomes presented here shows a cancer-specific AID mutational signature that overlaps with the CpG dinucleotide. Thus, AID mutagenesis linked with methylation/demethylation of CpG appears to be a widespread phenomenon in human cancers. The specific mechanisms of the interaction between the CpG (de)methylation and AID–mediated mutagenesis remain to be elucidated. The broader implication of these findings is that epigenetic effects can be directly relevant for somatic mutagenesis in many if not most cancers.

Methods

The exome sequencing data of 22 follicular lymphoma patients were described previously51. DNA sequences surrounding the mutated nucleotide represent the mutation context. We compared the frequency of known mutable motifs for somatic mutations with the frequency of these motifs in the vicinity of the mutated nucleotide. Specifically, for each base substitution the 120 bp sequence centered at the mutation was extracted (the DNA neighborhood). We used only the nucleotides immediately surrounding mutations because AID/APOBEC enzymes are thought to scan a limited area of DNA to deaminate (methyl)cytosines in a preferred motif26. This approach does not exclude any given area of the genome in general, but rather uses the areas within each sample where mutagenesis has happened (taking into account the variability in mutation rates across the human genome), and then evaluates whether the mutagenesis in this sample was enriched for AID/APOBEC motifs26. This approach was thoroughly tested and a high accuracy of the analysis was shown26. The frequency of mutable motifs in the positions of somatic mutations was compared to the frequency of the same motifs in the DNA neighborhood (Fig. 1) using Fisher exact test (2 × 2 table, 2-tail test) and Monte Carlo test (MC, 1-tail test) as previously described52,53,54 (for details see Supplementary Fig. 3). Somatic mutation data from ICGC and TCGA cancer genomic projects were extracted from the Sanger COSMIC Whole Genome Project v75 was downloaded from http://cancer.sanger.ac.uk/wgs. The tissues and cancer types where defined according to primary tumor site and cancer projects. Somatic mutations in various normal tissues were from48 (Supplementary Table 5).

We compared magnitude of the difference between the fraction of mutations observed in the mutable motif and the fraction of motifs in surrounding region (effect size) for somatic mutations in normal tissues. For the purpose of this comparison (power analysis), we used a sampling procedure that was repeated 1,000 times. Each sample of somatic mutations from blood and stomach cancers (where significant excess of somatic mutations in WRC/GYW motifs was observed, Tables 2 and 3) had the size equal to those for normal tissues (674 for blood and 49 for stomach, Supplementary Table 5). Analysis of the difference between the fractions showed that the difference for normal mutations was smaller for 98.3% blood cancer samples and for 94.7% stomach cancer samples. Thus the observed effect size (Supplementary Table 5) is likely to reflect biological properties of these samples and is unlikely to be a result of the small sample size at least for somatic mutations from blood and stomach.

For the AICDA gene expression analysis, the normalized version of the RSEM (Broad Institute TCGA Genome Data Analysis Center (2016) Analysis-ready standardized TCGA data from Broad GDAC Firehose 2016_01_28 run. Broad Institute of MIT and Harvard. Dataset. http://doi.org/10.7908/C11G0KM9) was used to analyze the TCGA RNA-Seq datasets from the Broad Genome Data Analysis Center. For each TCGA cohort (Supplementary Fig. 2). The low and upper bounds, median, outliers, and first and third quartiles were retrieved via the FireBrowse RESTful API (http://firebrowse.org/api-docs/) for the tumor and the corresponding normal (when available) tissue samples.

For the analysis of the association between somatic mutations, mutable motifs (WRCG/CGYW) and methylation, datasets for 26 patients with malignant lymphoma (https://dcc.icgc.org/projects/MALY-DE) were used. In the analyzed datasets, the data for all patients were pooled together (the Supplemental Dataset S2 contains the studied set of somatic mutations). Each position is characterized by the methylated/unmethylated read count and the methylation ratio (the number of methylated reads divided by the total number of reads overlapping this position and multiplied by 100). Only positions with more than nine associated reads were included in the analysis. The mean value for mutation positions with (M1) and without WRCG/CGYW (M2) mutable motifs (3620 and 11003 positions, respectively) was calculated. To compare the difference between these two types of positions, methylation ratio values from the larger dataset were randomly sampled until the number of positions was the same as in the smaller dataset. For each sampled dataset, the mean value (M2_sampled) was calculated and the probability P(M1 ≥ M2_sampled) was calculated from 10,000 sampled datasets. The same sampling procedure was used for for all genomic positions without taking into account positions of somatic mutations. Code availability: A set of ad hoc programs is available upon request from Igor B. Rogozin (rogozin@ncbi.nlm.nih.gov).

Additional Information

How to cite this article: Rogozin, I. B. et al. Activation induced deaminase mutational signature overlaps with CpG methylation sites in follicular lymphoma and other cancers. Sci. Rep. 6, 38133; doi: 10.1038/srep38133 (2016).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.