Introduction

Gene expression profiles have been used extensively in the study of cancer development, treatment response and prognosis. In particular, gene expression signatures have been developed to classify tumour subtypes [1, 2], predict response to endocrine treatment [3, 4], indicate prognosis [3, 5, 6] and predict tumour recurrence [4]. The development of these signatures has relied on capturing a change in average expression between biological groups (e.g. poor responders versus good responders) and subsequently validating these results in independent datasets. The variability in gene expression levels (e.g. across samples) is often ignored and is generally only considered in terms of the impact it has on statistical power. However, gene expression variability has been shown to be under genetic control and important to cellular function [7,8,9,10,11]. Gene expression variability has been used to evaluate transcriptomic data in human disease and development [12,13,14,15]. By isolating individual embryonic cells, researchers have shown that gene expression variability provides insights into gene regulation that is essential throughout embryonic development [12]. Differences in gene expression variability have also been investigated to improve our understanding of cancer biology. Although a few studies have been reported to date, this approach has led to the identification of a pan-cancer gene set [16], a classifier for chronic lymphocytic leukaemia [17] and synthetic lethal genes in BRCA2-associated ovarian tumours [18]. These studies measured gene expression variability of whole-transcriptome data generated from microarray or RNA-sequencing platforms. Two of these studies highlighted the utility of measuring global gene expression variability from microarray data to classify leukaemia subtypes [17] and to identify a 48 gene set as a predictive marker of cancer metastatic potential and patient survival [16]. One study identified genes from RNA-sequencing data with potential synthetic lethal interactions with BRCA2 in ovarian cancer [18]. Genetic variants in BRCA1 and BRCA2 predispose women to breast and ovarian cancers and are believed to contribute to 5–10% of all breast cancers and 20–40% of familial breast cancers [19, 20]. Despite representing distinct tumour types, there has been limited success in identifying gene expression profiles related to BRCA1 and BRCA2 pathogenic variants in either tumour [3, 21,22,23,24] or normal tissue [25,26,27,28]. Across all studies, consensus on altered genes and pathways has been poor, with gene expression profiles influenced by study design rather than variant classification [29]. For example, early studies were confounded by differences in oestrogen receptor (ER) status of BRCA1-associated tumours compared to sporadic [21]. The ER status of breast tumours was later identified to be a major driver of gene expression changes [3]. These studies have also overlooked the variability in gene expression and whether these phenotypes are associated with the presence of a pathogenic variant.

In this study, we investigate transcriptomic data across multiple breast tumour datasets using differential variability analysis to identify genes that are associated with BRCA1 and BRCA2 pathogenic variant status and with different tumour subtypes. We also utilised RNA in situ hybridisation as an orthogonal approach to validate inter-tumour expression variability of a candidate gene in an independent cohort of breast tumours.

Methods

Data collection

For gene expression variability analyses, we included any microarray dataset containing gene expression profiles on at least 50 breast tumours, of which > 25 samples were either BRCA1- or BRCA2-associated breast tumours (Table 1). Raw data were acquired through GEO (https://www.ncbi.nlm.nih.gov/geo/) for three datasets (GSE19177, GSE27830 and GSE49481) generated on Illumina, Affymetrix and Agilent microarray platforms, respectively [23, 30, 31]. Raw intensities collected on the Illumina and Agilent arrays were normalised using quantile normalisation. Additionally, for the Agilent array only the intensity values (CY5 channel) corresponding to breast tumour RNA were considered. Raw intensities obtained on the Affymetrix arrays were normalised using the RMA algorithm [32].

Table 1 Breast cancer datasets

In addition, a retrospective microarray dataset of well-curated breast cancer gene expression profiles [33] was utilised for a subtype-centric analysis [33]. Briefly, all samples were profiled on one of the Affymetrix U133A, U133A2 or U133plus2 GeneChip arrays. For consistency, only probes common to all arrays were retained. Individual arrays were normalised using the RMA method and batch effects were corrected using the COMBAT method [34]. The resulting dataset consisted of 2116 breast tumours and 22,268 probes.

Lastly, we accessed the METABRIC dataset through cbioportal (www.cbioportal.org/), in order to investigate DV genes relationship with molecular features.

Transcriptome-wide gene expression variability analysis

Transcriptome-wide gene expression variability was compared between BRCA1-associated tumours and non-BRCA1/2 (BRCAx) hereditary breast tumours, BRCA2-associated tumours and BRCAx and Basal-like and non-basal (luminal A, luminal B, HER2 and normal-like) tumours. To assess transcriptome-wide changes, tumour type-specific standard deviation (SD), coefficient of variance (CV) and median absolute deviation (MAD) were calculated for each gene. Pairwise linear regression models were calculated between tumour groups for each gene-specific statistics (i.e. SD, CV, MAD and mean) and the resulting β (slope coefficient) was compared to a model of tumour equity (β = 1). We defined the difference in expression variability between tumour groups as the percentage change in the slope coefficient compared to model of tumour equity (1 – β × 100). Additionally, pairwise polynomial regression was performed to investigate non-linear relationships between tumour-specific statistics.

Differential expression variability analysis

Genes were considered to exhibit differential expression variability (hereafter referred to as differentially variable—DV-genes) if the spread of expression values differed significantly between two tumour groups. For these analyses, the same tumour comparison was made as in the transcriptome-wide analyses (i.e. BRCA1 versus BRCAx, BRCA2 versus BRCAx and basal-like versus non-basal). Robust measures of spread were used to avoid spuriously large differences caused by outlying expression values. Spread was quantified by the median absolute deviation (MAD) and groups were compared by the Brown–Forsythe method [35], essentially an ANOVA on expression values about their medians, with p-values adjusted for multiple testing [36] was used to determine significance. Ratios of MADs with 95% confidence intervals [37] were used to quantify changes in spread between groups.

Breast cancer tissue microarrays

Women diagnosed with breast cancer were identified from 1800 families recruited into the Kathleen Cunningham Consortium for Research into Familial Breast Cancer (kConFab) [38]. For inclusion in this consortium, families must have a strong family history of breast and/or ovarian cancer, or be known to be segregating a germline variant in genes such as BRCA1 and BRCA2 (see www.kconfab.org for recruitment criteria). Breast cancer cases on tissue microarrays (TMAs) were verified from pathology reports. Ethics approval was obtained from the HREC at the Peter MacCallum Cancer Centre (97/27) and through the University of Otago Human Ethics Committee (H14/088). Informed consent at study entry was obtained from all participants, allowing access to medical/treatment reports, blood collection and tumour tissue collections. For deceased participants proxy consent was obtained from the next of kin. Where applicable, cause of death was verified from a death certificate, doctor or hospital medical records. Treatment and medical notes were accessed through physicians, hospitals, laboratories and State Cancer Registries.

Confirmation of a participant’s germline mutation status was performed using a variety of sequencing platforms in the Molecular Pathology NATA-accredited clinical laboratory. Variants were assigned a class C4–C5 (pathogenic) mutation status according to a 5-tier clinical classification introduced by ENIGMA (http://www.enigmaconsortiumorg).

RNA in situ hybridisation (RNAscope)

The mRNA expression level for EN1 was investigated using the RNAscope 2.0 BROWN assay (Advanced Cell Diagnostics, Inc.) following manufacturers’ instructions. Additionally, mRNA levels of PPIB and the bacterial gene dapB were used as positive and negative controls, respectively. Briefly, TMA sections were deparaffinised in a series of xylene and 100% ethanol steps. Sections were subjected to a series of pre-treatments before each section was incubated with target or control probes for 2 h at 40 °C in a HybEZ™ Oven. Probe signal was amplified through a series of amplification steps and colour development was done using diaminobenzidine (DAB).

Scoring/counting signals

All TMA cores were scored for abundance of PPIB and DapB signals where a positive signal was assessed as a brown punctuate dot within a cell. Supplementary Table 1 describes the criteria for each score. Briefly, a score of ‘0’ described an absence of signal, ‘0.5’ was a weak stain with < 30% of cells having a signal, ‘1’ was a modest stain with > 30% of cells with a positive signal, ‘2’ was considered moderate staining with a greater number (4–9) of signals per cell, ‘3’ was strong staining with the presence of signal clustering in < 10% of cells and ‘4’ was intense staining with signal clustering in > 10% of cells.

All TMA sections were scanned with the Aperio Scanner and digital scans were used to quantify mRNA signals using the LEICA RNA ISH algorithm (Leica Microsystems GmbH, Germany). Briefly, the algorithm was trained on a selection of cores from each TMA and across a range of genotypes. These settings were applied to all the cores stained with probes targeting EN1 RNA.

Transcription regulator enrichment analysis

To identify DNA binding elements that were overrepresented in a set of DV genes, the TFEA.ChIP package in R was used [39]. ChIP-Seq datasets were obtained through the TFEA.ChIP github page (https://github.com/LauraPS1/TFEA.ChIP_downloads). Briefly, datasets from the ENCODE Consortium, DeMap and individual GEO database were included in the analysis, and gene sequences were annotated based on transcription regulator binding within 1 kb of each gene.

Statistical analysis

R (version 3.6.1) was used to normalise microarrays and perform all statistical analyses. For microarrays on the Affymetrix platform, the RMA normalisation [32] from the affy package was applied, whilst for each of the Agilent and Illumina arrays, the quantile normalise method from the limma package was used [40]. The lawstat package was used to calculate Brown–Forsythe test. To account for multiple testing, p-values were adjusted using the Benjamini–Hochberg procedure [36].

Results

Transcriptome-wide gene expression variability

Differences in transcriptome-wide gene expression variability between familial breast cancer groups (BRCA1, BRCA2 and BRCAx) were quantified by comparing tumour-specific measurements of variability (SD, CV and MAD). BRCA1-related breast tumours share multiple biological properties with sporadic basal-like breast cancer [41], therefore we also assessed gene expression variability differences between the basal and non-basal breast tumour subtypes. BRCA1-associated and basal-like breast tumours displayed greater transcriptome-wide variability compared to non-BRCA1 and non-basal-like (luminal A, luminal B, HER2 and normal-like) tumours (Fig. 1, Supplementary Fig. 1). Using gene-specific standard deviations, BRCA1-associated breast tumours had a 22.8% (95% CI 22.3–23.2) increase in gene expression variability. Increased variability in BRCA1-associated breast tumours was also observed in linear models of gene-specific CVs (25.0%, 95% CI 24.5–25.6) and MADs (32.4%, 95% CI 31.9–33.0). Similarly, basal-like tumours had an average 28.2% (95% CI 27.7–28.7) greater transcriptome-wide expression variability compared to non-basal tumours. Gene expression variability in BRCA2-associated tumours was inconsistent, with the Waddell dataset showing comparable variability with BRCAx tumours (Supplementary Fig. 2), but the Larsen dataset showing only a modest 11.1% (95% CI 10.6–11.6) increase in variability in BRCA2-associated tumours compared to BRCAx tumours. The Nagel dataset had too few BRCA2-associated tumours to reliably estimate transcriptome-wide expression variability. In contrast to gene expression variability, no changes were observed in transcriptome-wide mean expression for any tumour groups tested.

Fig. 1
figure 1

Transcriptome-wide gene expression variability in breast tumours. BRCA1-associated and basal-like breast tumours each show greater gene-specific standard deviations compared to BRCAx and non-basal tumour, respectively. Additionally, global gene-specific means between tumour groups are depicted. A model of equity (red line) was compared to the linear model (blue dashed line) and polynomial regression (sky blue line)

Differential expression variability in breast tumours

As BRCA1-associated and basal-like breast tumours displayed greater transcriptome-wide expression variability, it was of interest to identify specific genes that had altered variability between tumour types. After adjusting for multiple comparisons, 503 and 337 genes were found to be significantly differentially variable between BRCA1- and BRCAx-associated breast tumours in the Nagel and Larsen datasets, respectively. There were 40 DV genes that were consistent in both the Nagel and Larsen datasets (Table 2). The directionality of the gene expression variability was consistent for all 40 DV genes. Additionally, 36 of the 40 BRCA1-associated DV genes were also DV between BRCA1-associated and sporadic breast tumours. A total of 185 basal-like-associated DV genes were identified across the four breast tumour datasets, with 184/185 (99.5%) consistent in direction. Analysis of the meta-cohort data, which comprised 2116 samples, found 58% of all genes to be DV. To identify candidate genes for in situ expression analysis, we compared the top 1.5% most significant DV genes across all datasets and between subtype and genotype analyses (Fig. 2A). A total of 10 BRCA1-associated and 22 basal-like-associated variable genes were identified in all datasets at 1.5% threshold. EN1 and IGF2BP3 were variably expressed in both BRCA1-associated and basal-like breast tumours (Fig. 2B).

Table 2 40 consensus significant BRCA1-associated variable genes
Fig. 2
figure 2

Identification of BRCA1-associated candidate gene(s). A Intersection between the consensus BRCA1-associated variable genes (green) and basal-like-associated variable genes (orange). B Significances of differential variability genes that intersected both analyses from A. C EN1 (left) and IGF2BP3 (right) expression in the METABRIC dataset stratified by intrinsic subtype. p-values were calculated by the Brown–Forsythe test comparing basal-like tumours to all others. D Correlation of EN1 (left) and IGF2BP3 with ESR1 expression and ER status (insert). All p-values were estimated using the Brown–Forsythe test

EN1 gene expression variability

We used the publically available METABRIC dataset [42] to further interrogate EN1 gene expression in 1904 breast tumours (Fig. 2C, D). Consistent with analysis of the four breast cancer microarray datasets, basal-like breast tumours had significantly greater EN1 expression variability (Fig. 2C, p = 3.7 × 10−146). Additionally, as BRCA1-associated breast tumours are typically oestrogen receptor (ER) negative we explored the correlation with ESR1. ER negative tumours had significantly greater expression variability (Fig. 2D, Brown–Forsythe test p = 2.19 × 10−106) and tumours with low ESR1 expression had a greater range of expression. Lastly, to determine if gene dosage was the cause of variable expression we measured the correlation of copy number with EN1 expression. There was no correlation between copy number and gene expression (r2 = 0.026) or expression variability (r2 = 0.023, Supplementary Fig. 3).

To assess intra-tumoural expression of EN1, breast tumour tissue microarray cores from 503 patients were assayed using RNA in situ hybridisation. These samples included tumour tissue from 151 BRCA1-associated and 124 BRCA2-associated breast cancer cases (Supplementary Tables 2, 4). Tissue microarrays were stained and scored for the abundantly expressed PPIB and a negative control targeting dapB. Tumour cores with scores > 2 for the PPIB and 0 for dapB were considered high quality and used to investigate EN1 expression. Two-hundred-and-thirteen tumours had positive PPIB mRNA signals with 141/213 tumours being scored 2 or greater for PPIB staining. No dapB signal was detected in any tumours.

The LEICA RNA ISH Algorithm was used to estimate the abundance of RNA signals and the percentage of positively stained cells. Consistent with our transcriptome analysis, in situ expression analysis of BRCA1-associated tumours showed significantly greater EN1 expression variability (Fig. 3, p = 6.7 × 10−04).

Fig. 3
figure 3

EN1 RNAscope® of breast tumour cores. A Representative images of BRCA1-, BRCA2- and non-BRCA1/2-associated breast tumours stained for EN1 and a section stained with the positive control probe (PPIB). B Percentage of EN1-positive tumour cells in each familial tumour type. A significant difference (Brown–Forsythe test) in EN1 expression variability was observed between BRCA1-associated and BRCAx tumours

Transcription regulators enrichment analysis

To identify transcription regulators that may be responsible for increased BRCA1-associated expression variability, we used the TFEA.ChIP package in R to identify transcription regulators overrepresented in the 40 BRCA1-associated DV genes (Table 2). Firstly, we identified individual ChIP-seq datasets associated with BRCA1-associated DV genes (Fig. 4) and then summarised these results at the transcription regulator level in order to identify transcription regulators that are enriched or depleted within BRCA1-associated DV genes (Supplementary Table 3). The top ranked ChIP-seq dataset in human mammary epithelial cells (HMEC) was associated with EZH2 as shown in (Supplementary Table 3). Further, EZH2 was the top ranked transcription regulator associated with BRCA1-related DV genes (Supplementary Table 3). Interestingly, EZH2 was significantly overexpressed in BRCA1-associated and basal-like breast tumours (Fig. 5).

Fig. 4
figure 4

Transcription regulators enrichment analysis for BRCA1 DV genes. Each dot represents significant over/underrepresentation of a single ChIP-Seq dataset. Log odds ratio (OR) > 0 indicates overrepresentation of BRCA1 DV genes in a ChIP-Seq dataset. Polycomb repressive complex 2 components, EZH2 (blue) and SUZ12 (yellow) are associated with BRCA1 DV genes

Fig. 5
figure 5

EZH2 expression in breast tumours stratified by basal-like molecular subtype (A) or BRCA1 (B) status. The difference of expression was calculated using the Welch t test

Discussion

The robust and replicable nature of gene expression measurements provides an excellent opportunity to investigate biological variability [11]. We explored publicly available familial breast cancer microarray datasets for phenotypes associated with BRCA1- and BRCA2-related breast tumours. In addition, a cohort of 2116 breast tumours profiled on Affymetrix microarray platforms was included to identify gene expression variability in basal-like tumours. In this study, we identified that BRCA1-associated and basal-like breast tumours displayed greater gene expression variability compared to non-BRCA1 and non-basal tumours, respectively. These observations were consistent across all microarray datasets explored and across three different measures of variability. Furthermore, no significant transcriptome-wide changes in mean expression were observed between any tumour groups. Together these results suggest that globally expression variability associates with phenotypic features of BRCA1-associated cancer to a stronger degree than mean expression and at an individual gene level, associations may not be statistically significant if only mean expression level is examined.

A number of studies have used transcriptome variability in humans to described differences in disease and development states [12, 14, 16,17,18]. One study by Bueno and Mar [18] has explored gene expression variability to identify synthetic lethal genes associated with BRCA2 loss-of-function ovarian tumours. The authors proposed that 54 stably expressed (low variable) genes in BRCA2-related tumours may be essential in BRCA2-related tumour viability. However, there was no formal statistical test that estimated the association of gene expression variability. Our analysis in BRCA2-associated breast tumours suggested that no genes were differentially variable. However, it is possible that the small number of BRCA2 samples in the microarray datasets hindered the identification of BRCA2-associated DV genes.

Transcriptome-wide gene expression variability may be driven by tumour heterogeneity, aggressiveness and cellular content. For example, more aggressive lymphocytic leukaemias are associated with greater gene expression variability [17], an observation that is consistent with the greater gene expression variability seen in BRCA1-associated and basal-like breast tumours. Furthermore, the epigenetic status of cells is a heritable trait that also has significant variability [43]. Specifically, DNA methylation can contribute to gene expression variability [44] and the methylation status of tumour cells may influence gene expression variability described in this study. Consistent with this, the top ranked transcription regulator associated with 40 BRCA1 DV genes was EZH2, a component of the Polycomb repressive complex 2 (PRC2). PRC2 is important for H3 lysine 27 trimethylation (H3K27me3) and the stable repression of transcription [45]. EZH2 has been shown to be overexpressed in a number of tumours including BRCA1-deficient breast cancers [45, 46]. In addition, loss of BRCA1 function and the decrease in BRCA1 expression can alter the occupancy of EZH2 on DNA and increase H3K27me3 levels [47]. These BRCA1-related changes in epigenetic regulation are potential mechanisms that may alter transcriptional control, ultimately leading to increased cellular gene expression variability.

Technical variation is expected to contribute randomly to each sample; however, as there was no standardised protocol between and within the datasets used, it is plausible that processes such as sample collection and RNA extraction may influence variability. By considering only genes that were variably expressed across all microarray datasets we are able to reduce false-positive associations induced by these types of experimental artefacts. Our approach identified two genes (EN1 and IGF2BP3) that had increased variability in BRCA1-associated breast tumours. EN1 encodes the transcription factor Engrailed Homoeobox 1 and has been extensively investigated in neuronal development. Ectopic EN1 expression improves neurons’ survival and protects against apoptosis [48]. Conversely, knockdown of expression in dopaminergic neurons has been shown to induce apoptosis [49]. EN1 has been observed to be overexpressed in triple-negative breast cancers and basal-like breast cancers [50,51,52]. SDs and inter-quartile ranges reported by these studies were suggestive of increased gene expression variability similar to that seen in this study; however, these were not formally tested. Additionally, EN1 expression has been associated with poorer survival and a greater probability of brain metastases. Interestingly, reduced expression of EN1 in basal, but not luminal, breast cancer cell lines decreases viability, which can be partially rescued by overexpression of EN1 [52]. The evaluation of EN1 protein by immunohistochemistry has been perplexing, with Kim and colleagues previously describing EN1 protein expression as being associated with improved survival in triple-negative breast cancer, opposite to that of EN1 RNA [51]. The discrepancy may in part be due to poor antibodies targeting EN1 [52] or post-transcriptional processes. In this study, we have investigated the utility of RNAscope to measure gene expression in situ and we were able to recapture the inter-tumour variability observed in the microarray analysis. The implementation of RNA in situ hybridisation can overcome the issue of antibody specificity and may help facilitate replication of survival trends. However, the lack of survival data limited our ability to formally test for associations. Currently, measurement of intra-tumoural variability remains challenging, particularly due to the discrimination of signal in dark-stained chromatin and granulated nuclei. The development of more sophisticated algorithms in the future may provide greater power to assess variability within individual sections and allow the ability to test for associations with clinicopathological data. Furthermore, to fully appreciate tumour variability, complete tumour sections may be required for future studies.

Conclusion

BRCA1-associated and basal-like breast tumours displayed a phenotype of greater gene expression variability, with no change in global RNA abundance. EN1 had greater expression variability in BRCA1-associated breast tumours, and this was captured in transcriptomic and RNA ISH analyses. We conclude that the expression variability of a gene is replicable, thereby laying a foundation for future studies aiming to better understand the molecular mechanisms underlying the development of basal-like breast tumours.