Introduction

Breast and prostate cancers are among the most common cancers in Western populations. Twin studies indicate that inherited genetic factors contribute to the development of ∼27% of breast and 42% of prostate cancer,1 and both cancers show two-fold elevated risks of the same disease in first-degree relatives of patients.2

To date, no high penetrance susceptibility loci have been identified for prostate cancer and most of the inherited risk is considered to be a consequence of the co-inheritance of multiple low-risk variants. Although high-penetrance susceptibility to breast cancer in the form of BRCA1 and BRCA2 mutations is well recognised, classical mutations in these genes only account for ∼20% of familial disease.2

Recent genome-wide association (GWA) studies of breast and prostate cancers have identified several common single-nucleotide polymorphisms (SNPs) at different loci that influence the risk of these tumours.3, 4, 5 These risk variants are common in the general population (minor allele frequency (MAF)>5%), and are associated with odds ratios typically <1.3. Despite the relatively small predisposing effects that they individually confer on tumour risk, they can have more profound effects by functioning in concert. Moreover, their identification has provided important and novel insights into the biology of both breast and prostate cancer.

The majority of cancer predisposition genes that have to date been identified through GWA studies function in a co-dominant manner, and studies have found no good evidence for recessively functioning disease loci. Although this may be reflective of the biology, it may also be a consequence of GWA studies having suboptimal ability to detect recessively functioning disease alleles. Clues that tumour susceptibility may have a recessive basis come from some, but not all, studies showing an increased incidence associated with consanguinity and in populations characterised by a high degree of inbreeding.6, 7, 8, 9, 10, 11 Further evidence for the role of homozygosity in cancer predisposition is provided by experimental animal inbreeding (eg, backcrossing mice) being associated with increasing tumour incidence,12 although such animal model systems are impoverished of variation compared with natural populations, their alleles being selected for survival in the homozygous state. Specific situations of homozygosity have also been directly associated with cancer, such as uniparental disomy through altered imprinting.13

Common regions of homozygosity (ROHs), the result of autozygosity, have recently been shown to occur at a high frequency in outbred populations as a result of selection.14 However, such selected regions are not likely to have been selected in relation to cancer, which is generally late onset relative to past human life expectancy. Therefore, searching for ROHs on a genome-wide basis provides a means of potentially exposing recessively functioning disease genes. Recently, Assié et al15 studied breast, prostate and head/neck cancer patients of Northern/Western European ancestry by whole-genome loss of heterozygosity analysis using a series of microsatellite markers. A significant increase in the frequency of homozygosity in cases compared with controls was reported. In a separate study of colorectal cancer using Affymetrix XbaI 50K SNP arrays, Bacolod et al16 showed that cases harboured significantly more homozygous regions than did healthy individuals. Collectively, these data provide support to the hypothesis that multiple, recessive, cancer-predisposing loci exist, which are not readily detected using a conventional GWA approach based on analysis of individual SNPs. A possible explanation for this is that relative risks per locus are too low and/or that the disease-associated variants are not in strong linkage disequilibrium (LD) with tag SNPs, perhaps because of low allele frequencies.

Although GWA studies have limited ability to identify recessive disease-causing alleles through single SNP analyses, these data sets can potentially be exploited to search for this class of susceptibility allele through whole-genome homozygosity analysis (WGHA). Hence, to examine whether homozygosity is associated with an increased risk of developing breast or prostate cancer and to search for novel recessively acting disease loci, we conducted a WGHA of breast and prostate cancer GWA study data generated by the Cancer Genetic Markers of Susceptibility (CGEMS) initiative.

Materials and methods

Data set

This study was based on GWA study data on breast and prostate cancer generated by the CGEMS, which are publicly accessible. Full details of the studies are provided in previously published material. Briefly, using Illumina Infinium HD Human550 Duo BeadChips (Illumina, San Diego, CA, USA), the GWA study of breast cancer involved genotyping 1183 breast cancer cases and 1185 controls.17 In the prostate GWA study, 1177 prostate cancer cases and 1149 controls were genotyped using Illumina Infinium Human217 and Human 317 BeadChips (Illumina).18 Of the 1177 prostate cancer cases analysed, 689 had aggressive disease, as defined by a Gleason score ≥7 and disease stage ≥III, and 488 prostate cancer cases had non-aggressive disease.

Quality control

We restricted our analysis to autosomal SNPs. We considered that a DNA sample had failed if it did not generate a genotype for >95% of loci. Similarly, an SNP was considered a failure if <90% of DNA samples generated a genotype at the locus. To identify samples showing relatedness, identity by state (IBS) values were calculated for pairs of individuals; for any pair with >80% identical SNP genotypes, we removed the sample with the lower call rate from the analysis. We excluded SNPs on the basis of deviation from Hardy–Weinberg equilibrium (HWE) using a threshold of P<1 × 10−3 in either cases or controls. We also removed SNPs with MAF<0.01. To identify and exclude individuals with non-Western European ancestry, case and control data were merged with data obtained from individuals of different ethnicities from the International HapMap Project, from genome-wide IBS distances for markers shared between HapMap and our SNP panel, and from dissimilarity measures used to perform principal component analysis.

After imposing these stringent quality control measures for the breast cancer GWA study, 512 159 SNP genotypes were available on 1144 cases and 1141 controls. For the prostate cancer GWA study, 509 008 SNP genotypes were available on 1168 cases (685 with aggressive disease) and 1093 controls. These two data sets formed the basis of our WGHA.

Statistical and bioinformatics analysis

We detected ROHs using PLINK19 (v1.06) software (http://pngu.mgh.harvard.edu/~purcell/plink/contact.shtml#cite). The ROH tool moves a sliding window of SNPs across the entire genome. To allow for genotyping error or other sources of artificial heterozygosity, such as paralogous sequences, within a stretch of truly homozygous SNPs and, hence, to prevent underestimating the number and size of ROHs, 2% heterozygous SNPs were permitted in each window. We set the remaining options to default values (including allowing five missing calls per window, thereby ensuring >90% positive-predictive value of each ROH), except that we varied the parameters for homozyg-snp according to our heuristic preferences for defining ROH as detailed below. Subsequent statistical analyses were performed using packages available in R (version 2.7.0; R Foundation for Statistical Computing, Vienna, Austria) and specifically written Perl code. Comparison of the distribution of categorical variables was performed using the χ2-test. To compare the difference in the average number of ROHs between cases and controls, we used the Student t-test. Naive adjustment for multiple testing was based on the Bonferroni correction.

We used three metrics to investigate the selection pressure on each ROH. The integrated haplotype score (iHS) is based on LD surrounding a positively selected allele compared with background, providing evidence of recent positive selection at a locus.20 An iHS score >2.0 reflects the fact that haplotypes on the ancestral background are longer compared with those on the derived allelic background. Episodes of selection tend to skew SNP frequencies in different directions and Tajima's D is based on the frequencies of SNPs segregating in the region of interest.21 Fixation index (Fst) measures the degree of population differentiation at a locus, taking values from 0 to 1.0.22 iHS, D and Fst metrics were obtained from Haplotter Software (University of Chicago, Chicago, IL, USA; http://haplotter.uchicago.edu/selection/).20

Identification of runs of homozygosity

To examine commonly occurring ROH and to detect statistically significant associations, we only considered ROH that occurred in ≥10 individuals, thereby ensuring that our study had 80% or better power to identify statistically significant differences in ROH between cases and controls. The initial search for ROH was performed using PLINK,19 with a specified length of 80 consecutive SNPs. This ROH length was chosen to be more than an order of magnitude larger than the mean haploblock size in the human genome without being too large to be very rare. The likelihood of observing 80 consecutive chance events can be calculated as follows:14 In the breast cancer series, mean heterozygosity in controls was calculated to be 34%. Thus, given 512 159 SNPs and 2285 individuals, a minimum length of 58 would be required to produce <5% randomly generated ROHs across all subjects ((1–0.34)58 × 512 159 × 2285=0.040). A consequence of LD is that SNP genotypes are not always independent, thereby inflating the probability of chance occurrences of biologically meaningless ROHs. Analysis based on the pairwise LD SNP pruning function of PLINK, with a default value of r2>0.8 being necessary to declare that one SNP tags another and restricting the search of tags to within 250 kb, showed 370 611 separable tag groups, representing a 27.6% reduction of information compared with the original number of SNPs. Thus, ROHs of length 80 were used to approximate the degrees of freedom of 58 independent SNP calls. Corresponding calculation in the prostate cancer series resulted in the same criterion being adopted.

Once all ROHs of at least 80 SNPs in length were identified, these were pruned to only those ROHs that occurred in >10 individuals. To ensure that a minimum length and minimum number of SNPs in each ROH were maintained, each individual's SNP data were recoded as one if the SNP was in an ROH for that individual and zero otherwise. Thereafter, for each SNP, those SNPs with <10 individuals coded as one were recoded to zero before removing any ROHs that, because of this recoding, were now <80 SNPs in length. Therefore, this process resulted in a list of ‘common’ ROHs having a minimum of 80 consecutive ROH calls across ≥10 samples and with each ROH having identical start and end locations across all individuals in whom that ROH is observed.

Results

Before conducting a WGHA of the two GWA data sets, we subjected cases and controls to rigorous quality control in terms of excluding samples and SNPs with poor call rates. Furthermore, we excluded SNPs showing significant departure from HWE. We then critically evaluated data sets for ancestral differences by principal component analysis. Figure 1 shows that all sample series were ancestrally comparable and were representative of European ancestry after excluding individuals of different ethnicity.

Figure 1
figure 1

Comparison of ethnicity in each of the sample series. The first two principal components of the analysis were plotted. HapMap data are plotted in grey; CEU individuals as ‘•’; CHB+JPT individuals as ‘▴’; and YRI individuals as ‘+’. Plotted in black, as ‘X’, are (a) breast cancer cases, (b) prostate cancer cases, (c) breast cancer controls and (d) prostate cancer controls.

A total of 415 and 426 ROHs were identified in breast cancer and prostate cancer series, respectively (Supplementary Tables 1 and 2), encompassing ∼35% of the genome, as measured by both the total chromosomal length and the number of included SNPs in both series. Figure 2 shows the similarity between genome-wide plots of the location of each ROH among the genomes of both cancer cases and controls from the two studies.

Figure 2
figure 2

Genome-wide plots showing the location of each run of homozygosity among the genomes in (a) breast cancer cases and (b) controls; (c) prostate cancer cases and (d) controls. The threshold limit was set to a minimum of at least 80 consecutive homozygous SNPs.

In both data sets, six ROHs exceeded 12 Mb in length and included ROHs encompassing the centromeric regions of chromosomes 3, 5, 6, 8, 11 and 16 (Supplementary Tables 1 and 2). Five other ROHs exceeded 12 Mb in one of the series and were all >8.7 Mb in the other series. Although the length of these large ROHs across centromeres is partly a consequence of extended regions that are not annotated by SNPs, this is, however, unlikely to be the sole explanation, as these centromeric regions are flanked by large homozygous regions. One of these centromeric regions (8p11.21–8q11.23) has been previously highlighted in several genome-wide studies of selective sweeps, thus validating our methodology.20, 23, 24, 25 Five non-centromeric regions harbouring ROHs >12 Mb in length in one of the series and >8.5 Mb in the other series were also identified: 3q25.31–26.1, 4q13.1–13.3, 6q13.5–15, 7q31.1–32.1 and 14q13.3–22.1 (Supplementary Tables 1 and 2).

In both series, the ROH covering the largest genomic region (28 Mb) spanned the centromere of chromosome 3 (3p12.3–3q13.11), a region previously shown to be characterised by a high frequency of ROH in the European population.24 This ROH was also defined by the largest number of SNPs in a single ROH in both studies.

There were 15 ROHs that had a frequency >25% in the breast cancer control series and 16 in the prostate cancer control series (Table 1). In all, 13 of the ROHs were common to both data sets. Of those common to both data sets, five have previously been reported to have high frequency and to harbour several gene categories that have been identified to be influenced by a high degree of selective pressure.20, 23, 24, 25 Publicly available data from HapMap do not indicate that these regions have excessive copy number variation or segmental duplication, nor do they have very low recombination rates.23 However, the high iHS, D and Fst metrics for each region are compatible with positive selection in the Caucasian samples from which the two case series are based (Table 1).

Table 1 List of ROHs with frequency of >25% in either of the two control series

The total number of common ROHs observed in each individual was calculated to permit genome-wide comparison between case and control groups in each of the two data sets. Therefore, each individual was assigned a value between 0 and 415 in the breast cancer series and a value between 0 and 426 in the prostate cancer series. Overall, patients with breast cancer (mean=21.10, SD=4.75) and controls (mean=20.97, SD=4.75) showed no significant difference in the average number of ROHs (t2277=0.6135, P=0.54). Similarly, for prostate cancer, overall statistics showed no significant difference (t2255=0.5434, P=0.59) in the average number of ROHs between cases (mean=21.07, SD=4.79) and controls (mean=20.96, SD=4.71).

In addition, to examine whether there were differences in the distributions of ROH in the genomes of cases and controls, we computed cumulative distributions for both series (Figure 3). This analysis also provided no support for a difference in autozygosity profiles between cases and controls in either data set on a genome-wide basis. Furthermore, cases and controls from both series were similarly directly comparable.

Figure 3
figure 3

Cumulative distributions of ROH in breast cancer and prostate cancer series. The graph is presented in such a way that each data point represents the cumulative fraction (y axis) of samples with the corresponding minimum cumulative run of homozygosity (x axis).

At an individual level, six ROHs differed significantly (P<0.01) between breast cancer cases and controls (Table 2). Five of these six ROHs were more common in cases than in controls. ROH286 was identified in 1.4% of cases (n=16) compared with 0.2% of controls (n=2) (P=0.0009). ROH202 mapping to 6q22.31–22.3 was identified in 4.3% (n=49) of cases compared with 2.1% of controls (n=24). Intriguingly, this region of association to which ECHDC1 (enoyl coenzyme, a hydratase domain-containing protein 1; MIM 612136) and RNF146 (ring-finger protein 146; MIM 612137) map has previously been linked to breast cancer susceptibility in a GWA study of Ashkenazi Jews.26 After adjusting for multiple testing, however, these associations, as well as others, were not statistically significant.

Table 2 List of ROHs significantly associated with (a) breast cancer and (b) prostate cancer risk (P<0.01).

Only one of the four ROHs that differed significantly (P<0.01) between prostate cancer cases and controls had a higher prevalence in cases (ROH86), occurring in 19.2% (n=224) of cases and in 14.7% (n=161) of controls (P=0.0049; Table 2). Interestingly, PCGEM1 (prostate-specific gene 1; MIM 605443) maps to the centromeric region of ROH86. PCGEM1 is a prostate tissue-specific, and prostate cancer-associated non-coding RNA (ncRNA) gene, which has a role in the regulation of apoptosis.27, 28 PCGEM1 displays androgen-dependent expression and is overexpressed in ∼80% of prostate cancer, suggesting specific functions of PCGEM1 in the biology and tumourigenesis of the prostate gland.27, 28 ROH149, which maps to 4q28.2–28.3, was identified in 7.0% (n=82) of cases compared with 10.1% (n=110) of controls (P=0.0095). An association between 4q28.2 defined by microsatellite D4S2394 and prostate cancer was identified by Assié et al15 (P<0.001). The clinical behaviour of prostate cancer is heterogeneous and is likely to reflect differences in tumour biology and possibly molecular aetiology. In this regard, we conducted a subgroup analysis of prostate cancer cases stratified by aggressive phenotype. However, this analysis did not provide any further support for a relationship between ROH and prostate cancer risk (data not shown).

Discussion

Two recent studies of cancer have reported that signatures of autozygosity correlate with tumour incidence, and it has been proposed that these regions showing IBD may be the locations of genes contributing to tumour heritability.15, 16 Moreover, these data have been interpreted as providing an explanation for increased risk of various cancers in inbred groups, which has often been reported.

In this study, we have used high-density genomic scan data to compare the structure of genetic variation in patients with breast and prostate cancer with healthy controls. By imposing stringent quality control, we have ensured that individuals in our study were from an apparently panmictic population with no evidence of stratification. Our data provide further evidence that ROHs, ranging in size from 1 to 28 Mb, are common in individuals from an outbred population.29, 30, 31, 32 As documented in Table 1, the common ROHs that we have identified in controls are representative of autozygosity due to distant consanguinity rather than due to chromosomal abnormalities or common copy number variants. Moreover, these homozygous regions are too common and small to be caused by recent consanguinity, instead they are consistent with regions under selective pressure.33 On the basis of our analysis, there was, however, no evidence of an association between homozygosity and risk of breast or prostate cancer – either on the basis of total ROH size per individual or specific ROH.

The assertion that increased autozygosity correlates with cancer incidence provides an attractive explanation for reported increased cancer risk in inbred populations. However, as recently articulated, several criticisms can be levelled at this assertion. The observation of an increased cancer risk associated with consanguinity has often been based on studies of a small number of individuals in an isolated community or a single large family with a high level of inbreeding.34 Thus, the relevance of inbreeding to the population risk of cancer is unclear, as inbreeding and founder effects may be confounded. Sample sizes in molecular studies,15, 16 which have sought to establish a relationship between ROH and cancer risk, have generally been small and, crucially, cases and control groups have been ethnically heterogeneous or unmatched. Furthermore, the study of breast, prostate and head/neck cancer reported by Assié et al made use of relatively sparse microsatellite data. In this study, we have addressed these shortcomings in our study of breast and prostate cancer by analysing a large set of cases and controls that have been genotyped for several hundred thousand SNPs and imposed a high level of quality control both in terms of genotyping and sample ancestry.

By not advocating the use of uncorrected P-values, it is intriguing that our analysis identified two regions previously implicated in susceptibility to these cancers. Although this suggests that further interrogation of other regions may prove profitable, there is limited support from our analysis that WGHA of outbred populations provides a robust methodology for identifying novel cancer susceptibility loci.

In conclusion, our findings do not provide evidence that levels of measured homozygosity, whatever their aetiology, from autozygosity, uniparental isodisomy or hemizygosity, confer an increased risk of developing either breast or prostate cancer in a predominantly outbred population. Moreover, it is unlikely that large numbers of recessive alleles exist that predispose to breast or prostate cancer and are unmasked by autozygosity in most European populations. This analysis does not exclude the possibility that recessively functioning disease alleles exist for these cancers, whatever their penetrances, or that autozygosity may operate in populations characterised by a high level of inbreeding.