The current excitement about copy-number variation: how it relates to gene duplications and protein families

https://doi.org/10.1016/j.sbi.2008.02.005Get rights and content

Following recent technological advances there has been an increasing interest in genome structural variants (SVs), in particular copy-number variants (CNVs) – large-scale duplications and deletions. Although not immediately evident, CNV surveys make a conceptual connection between the fields of population genetics and protein families, in particular with regard to the stability and expandability of families. The mechanisms giving rise to CNVs can be considered as fundamental processes underlying gene duplication and loss; duplicated genes being the results of ‘successful’ copies, fixed and maintained in the population. Conversely, many ‘unsuccessful’ duplicates remain in the genome as pseudogenes. Here, we survey studies on CNVs, highlighting issues related to protein families. In particular, CNVs tend to affect specific gene functional categories, such as those associated with environmental response, and are depleted in genes related to basic cellular processes. Furthermore, CNVs occur more often at the periphery of the protein interaction network. In comparison, protein families associated with successful and unsuccessful duplicates are associated with similar functional categories but are differentially placed in the interaction network. These trends are likely reflective of CNV formation biases and natural selection, both of which differentially influence distinct protein families.

Introduction

Gene duplication is a major process leading to novel genes and proteins, which may naively be assumed to be a relatively slow process in evolutionary terms. However, recent results from the field of genetics argue that gene duplication has occurred frequently during the recent history of the human population and that gene duplicates occur in humans in variable numbers and may be constantly generated de novo. Studies measuring human genome variation are receiving much attention currently [1], as novel genomics approaches have revealed an unanticipated level of genetic variation in the human population (e.g. [2, 3, 4••, 5, 6, 7]). A type of variation that was recently found to be abundant in the human genome is genome structural variation [4••, 5, 7, 8, 9, 10, 11, 12, 13, 16]. Genome structural variants are generally defined as (e.g. [14, 15]), kilobase- to megabase-sized deletions, insertions, duplications, and inversions. Furthermore, structural variants cause more sequence differences between humans than the widely studied single nucleotide polymorphisms (SNPs) [4••, 5, 6, 7, 14, 15], if one considers the total number of nucleotides spanned by both forms of variation. Even though genome structural variants can alter the intron/exon-structure of genes by disrupting exons or fusing genes together [5], they frequently span entire genes leading to different gene copy-numbers between individuals. Following the first genome-wide mapping in humans [8, 16], genome structural variants have been identified in several mammalian genomes (e.g. [17, 18, 19, 20, 21•, 22]) and at varying levels of resolution (see Box 1). Here, we use the term copy-number variant (CNV) to refer to a genome structural variant leading to changes in gene copy number (rather than an inversion or a variant not encompassing genes, both of which may also influence protein function, for example by influencing gene regulation). While there have been early insights on the origin of small indels (≪1 kb) (e.g., [23, 24]), we are just now beginning to understand which mechanisms are commonly behind the formation of CNVs. Recent advances in understanding these formation-mechanisms [5, 25] have been fueled by the development of approaches for mapping genome structural variation at the resolution of base-pairs (Box 1).

CNVs are of significance in relation to the human proteome in various ways. First, copy-numbers of protein-coding genes can be strikingly different between apparently healthy (‘normal’) individuals (e.g. [4••, 28, 29••]), with instances of up to 10 additional copies reported for several protein-coding gene loci (e.g. [28, 29••, 30, 31]). In line with this, CNV de novo-formation is thought to be constantly ongoing in mammalian genomes [21•, 26, 27], with obvious implications for protein evolution; in fact, CNV genesis may occur at rates higher than point mutations with impact on gene function. Finally, results from several studies point to a tight relationship of gene copy-number with messenger RNA and protein expression-level (e.g. [29••, 32]). Variation at the level of gene expression may represent the underlying basis for several phenotypic traits associated with CNVs, such as dietary preferences across distinct populations [29••] or the susceptibility to diseases including HIV [28], breast cancer [33], autism [26], and several auto-immune diseases [30, 31, 34]. Furthermore, through this ‘gene-dosage’ effect, CNVs may influence protein complex formation and tightly regulated cellular systems. Since some of these require their individual components to be expressed at stoichiometrically precise levels, a CNV may have potential harmful (or beneficial) effects. Many additional phenotypic relationships are likely to be discovered in the near future with the ongoing application and improvement of approaches for ascertaining CNVs (Box 1) and for associating CNVs and phenotypes [35, 36, 37]. Thus, CNVs are not only relevant to population genetics, but should to be considered in systems biology and proteomics studies. CNVs may constitute a source of redundancy and thus evolvability or robustness, that is provide ‘replacement proteins’. Often, CNVs will behave selectively neutral (similar to most SNPs). Nevertheless, they represent a genomic pool of evolving transcripts, genes, and proteins that in longer evolutionary terms may become fixed in the population as novel genes. Here, we summarize recent findings in the field of genome structural variation and discuss implications for the systems biology and proteomics fields.

Knowledge on CNVs has dramatically increased following recent technological advances (see Box 1). For instance, a CNV map generated from data of over two hundred individuals has revealed that 12% or more of the human genome may be prone to copy-number variation [4••]. Recent studies at considerably higher resolution sufficient to map small CNVs (<50 kb) and to identify the precise boundaries (or breakpoints) of CNVs have revealed that the number of genome structural variants (>1 kb) that distinguish genomes of different individuals is at least on the order of 600–900 per individual [5, 6]. Of these, approximately ∼150 genome structural variants per individual presumably directly affect protein-coding genes by intersecting with them [5]. Moreover, recent surveys have led to a re-estimation of the total amount of sequence divergence between individuals; while it was initially assumed that the genomes of two unrelated individuals differ by ∼0.1% (mainly because of SNPs), it has recently been estimated that at least 0.5% of our genomes differ [6], with the majority of variation being owing to CNVs.

Recent findings concerning the abundance of CNVs in the human genome add to current perspectives on gene duplication and loss – essential processes in genome and proteome evolution. For nearly a hundred years, duplication of genetic material has been regarded as an important factor in the evolution of higher organisms (see [38] and references therein) – and protein birth by duplication is widely considered to be more common than formation of proteins ‘from scratch’ [39]. Following gene duplication, one of the newly generated paralogs may escape selective constraints (purifying selection) and become free to acquire a new function (neo-functionalization). Furthermore, both paralogous sequences may experience decreased selective pressure after duplication, which may reflect partitioning between paralogs into different functions which had been combined in the multifunctional ancestral gene [40, 41] (sub-functionalization). Gene duplication is also thought to be a major contributor to the evolution of protein networks [42], even though it may not account for the evolution of complex molecular machines [43]. Duplications may evolve in an effectively neutral fashion over extended evolutionary time scales [41]. They further may be advantageous to the cell by increasing the robustness against mutations (e.g. [37]). Moreover, at short evolutionary time scales the potential to modify gene/protein expression levels through gene dosage change may promote gene duplications and losses. In this regard, a genome-wide study [32] has recently reported relationships between CNVs and mRNA levels. Furthermore, Perry et al. [29••] found that increased copy-numbers of the amylase gene reflect higher levels of protein expression and are correlated with dietary preferences for starch. Note that a single CNV formation event – a type of mutation that for some genomic loci appears to occur more frequently than nucleotide substitutions (see below) – may be sufficient to specifically promote gene expression modification; thus gene copy-number changes may facilitate evolutionary adaptation involving protein abundance change. Nevertheless, nucleotide substitutions having an effect on the regulation of gene expression are likely to eventually supersede gene-copy number increase (or decrease), that is take over in the long run; in particular, maintaining a large number of identical genes per genome during longer evolutionary time scales is likely causing significantly increased ‘costs’ related to genome stability and repair.

The abundance of CNVs in the genome indicates that gene duplication (and loss) probably occurs at a constant and high rate in humans. For a number of loci in the genome involved in commonly recurring genomic disorders – regions in which CNVs may recur frequently – this rate has recently been estimated to be 1e-4 to 1e-6 per generation [44], which is considerably higher than the rate at which point mutations are thought to occur (2e-8; see refs. in [44]). Furthermore, in a recent analysis involving inbred mice, CNV formation rates as high as 1e-2 to 1e-3 have been inferred for loci encoding genes [21]. Note that in order to properly compare these rates we have to take into account the fact that the rate at which CNVs arise has been determined for large loci and large CNVs, for example of 100 kb in size, whereas the point mutation rate is given per nucleotide. If we consider that ∼1% of the genome comprises coding sequence, then the rate at which protein coding sequence will experience a new point mutation within a given 100 kb locus is approximately 2e-8 * 1e5 * 0.01 = 2e-5. Conversely, any given novel CNV of 100 kb would affect protein-coding sequence in the given locus. Thus, for several gene loci, CNVs formed de novo may be significantly more likely to affect coding sequence than point mutations. Frequently, point mutations will remain silent (e.g. if they fall into synonymous sites) and may have little or no effect on protein function. On the other hand, protein duplicates may not always be expressed, and expression differences may sometimes have little or no functional consequence.

It is evident from genome-wide surveys that CNVs exhibit a highly non-uniform distribution along chromosomes. This distribution may have different causes: First, it may be owing to biases in the ascertainment of CNVs. Second, locus-specific differences in the rate at which CNVs are formed may cause this disparity. Finally, the distribution may be owing to natural selection acting differentially throughout the genome, that is relative to phenotypic changes caused by different genomic regions that are affected by CNVs.

We believe that the fact that several complementary technologies have detected CNVs at overlapping genomic loci (which becomes quickly apparent when browsing the Database of Genomic Variants (DGV) [16]) indicates that technological biases are unlikely to be responsible for the trend.

However, discerning the remaining two potential causes is not straightforward. Mutation, population-variation and fixation by natural selection or random drift have been studied extensively in relation to SNPs, but much less so for CNVs. The existence of genomic loci undergoing recurrent de novo structural rearrangements in relation to disease [44] suggests that genomic CNV formation biases exist. In this regard, for instance, subtelomeric regions represent hot spots for interchromosomal recombination [45] and segmental duplication of genomic sequence [45, 46]. In line with this, results from Redon et al. [4••] indicate an enrichment of CNVs in subtelomeric regions (within 500 kb of the ends of chromosome arms). Consequently, breakage or fusion of chromosomes during the evolution of mammalian genomes may have influenced the rate of duplication (and loss) of gene families across species.

Natural selection can be analyzed by studying the overlap of CNVs with various functional elements. For instance, recent studies have revealed that protein-coding genes, and also other genomic elements including highly conserved non-coding regions, tend to be depleted among CNVs, indicating purifying selection [4••, 5, 47]. In particular, deletions appear to be under stronger selection than duplications [4••]. Furthermore, certain functional categories of protein-coding genes are more prone to be affected by CNVs than others. For instance, Table 1 shows a strong enrichment among CNVs for several protein domains. Our survey presented in Figure 1 extends this analysis by assessing which protein functional categories are most strikingly enriched or depleted amongst CNVs: consistent with earlier surveys we find that proteins involved in processes related to environmental response tend to be enriched in CNVs [4••, 5, 8, 9, 14, 48, 49] and duplicated genes retained in the genome [50], whereas proteins involved in fundamental cellular functions, such as cellular physiological processes, tend to be depleted. While the latter trend is presumably due to purifying selection owing to constraints, some of the former enrichment may be owing to positive selection. Such effects should be observable also in fixed variants. Hence, we extended our survey by comparing CNVs to ‘successful duplicates’ (i.e. recent segmental duplications), and ‘unsuccessful duplicates’ (nonprocessed pseudogenes, i.e. duplicated genes that were recently inactivated by mutation; e.g. [51, 52]). The distributions for successful duplicates reveal trends similar to the ones observed for CNVs and to some degree for duplicated pseudogenes (Figure 1b). That is, we observe enrichments for environmental response categories, consistent with earlier surveys (e.g. [51]). Our results are consistent with constraint (purifying selection) acting on dosage sensitive genes, leading to the removal of extra gene copies that cause dosage imbalance.

Additionally, our survey shows that unsuccessful duplicates tend to be longer than successful duplicates (Table 2), both at the gene and at the protein level. Although this trend may partially be influenced by the way successful and unsuccessful duplicates have been ascertained, the observations are in line with previous findings that complex genes, such as alternatively spliced ones that are on average longer than non-alternatively spliced genes [53], tend be less prone to duplication than genes with few exons and no or few additional splice forms [54, 55, 56].

Selection rarely acts on functions carried out by a protein ‘in isolation’. Most proteins, rather than working as a single entity, act in concert as members of a tightly regulated pathway or as a large multi-protein complex. Consequently, the level at which proteins tend to be affected by CNVs is partially reflected in the protein's role in the protein interaction network, that is the entirety of proteins thought to interact in the cell: Recently, it was shown that CNVs are more likely to affect proteins at the periphery of the network (with few interaction partners), whereas proteins at the network center (many interaction partners) are less likely to be variable in copy number [57•, 58•]. These observations are consistent with an over-representation of small protein families (having few or no paralogs) in the center of protein networks [59] and the observation that members of large protein families tend not to be involved in protein complexes [60]. It is plausible that proteins at the network periphery are under less evolutionary constraint and are thus freer to evolve. In contrast, duplicates affecting the network center may be detrimental and thus more likely to be selectively removed. The latter is strongly supported by the fact that unsuccessful gene duplicates are observed at the network center at a significantly higher frequency than successful duplicates (Figure 2).

Besides purifying selection, positive or directional selection has been implicated in influencing the distribution of CNVs and successful duplicates in the human genome. For instance, genes frequently affected by CNVs were reported to exhibit elevated rates of amino acid change in evolution [48], which may be an indicator for positive selection. Moreover, a recent case study focusing on the salivary amylase protein Amy1 has concluded that AMY1 gene copy number in human populations likely underlies diet-related positive selection pressures [29••]. Furthermore, CNVs are, similar to positively selected nucleotide changes, biased to the protein interaction network periphery [57]; thus, adaptive evolution – involving SNPs or CNVs – may act at the periphery of the network rather than the center. Concerning successful gene duplicates, several groups have reported signs of positive selection (at the level of amino acid replacements) for recently generated gene duplicates in primates (see e.g. [61, 62]) and rodents [63]. Finally, a recent computational analysis has presented evidence for substantial positive selection in hotspots of recently formed segmental duplications in humans [64]; these hotspots are presumably subject to recurrent de novo gene duplication.

At least for some genes it appears that gene copy-number may evolve in a neutral fashion: for instance, Nozawa et al. [56] reported that no significant difference exists in the amount of CNVs between functional and nonfunctional (i.e. pseudogenic) sensory receptor genes, a gene family particularly prone to structural variation (e.g. [4••, 5]). On the other hand, the positive effect of gene duplication or loss in the case of CNVs spanning more than one gene may in some instances balance or overshadow the potentially negative impact of protein dosage imbalances and may drive the fixation of CNVs in particular regions of the genome.

Nevertheless, negative effects of commonly occurring CNVs are also visible in current CNV datasets (Figure 3). In particular, a survey in which we linked protein domains present in CNVs to the Online Mendelian Inheritance in Man (OMIM) and the Cancer Gene Census (CGC) [66] databases revealed an enrichment of copy-number variable genes amongst disease-related genes; indeed, potentially positive effects of CNVs need to be balanced against harmful influences of genome structural variation. Improved techniques for extensively mapping structural variation in the genome (Figure 4) will soon enable studying associations of CNVs and diseases comprehensively at high resolution.

Section snippets

Conclusions

CNVs should be considered in systems biology and proteome evolution-related studies owing to their effect on protein expression, function and the phenotype, and their likely contribution to protein family evolution. After formation and subsequent fixation following selection or random drift, CNVs may give rise to gene duplicates or losses; thus they represent important genomic intermediates in genome and proteome evolution.

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgements

Funding was provided by a Marie Curie Fellowship (J.O.K.) and the NIH (Yale Center of Excellence in Genomic Science grant). The authors thank Pedro Alves and Jeroen Raes for valuable comments on the manuscript.

References (75)

  • K.K. Wong et al.

    A comprehensive analysis of common copy-number variations in the human genome

    Am J Hum Genet

    (2007)
  • E. Tuzun et al.

    Fine-scale structural variation of the human genome

    Nat Genet

    (2005)
  • D.F. Conrad et al.

    A high-resolution survey of deletion polymorphism in the human genome

    Nat Genet

    (2006)
  • S.A. McCarroll et al.

    Common deletion polymorphisms in the human genome

    Nat Genet

    (2006)
  • D.A. Hinds et al.

    Common deletions and SNPs are in linkage disequilibrium in the human genome

    Nat Genet

    (2006)
  • A.J. Sharp et al.

    Segmental duplications and copy-number variation in the human genome

    Am J Hum Genet

    (2005)
  • J.L. Freeman et al.

    Copy number variation: new insights in genome diversity

    Genome Res

    (2006)
  • A.J. Iafrate et al.

    Detection of large-scale variation in the human genome

    Nat Genet

    (2004)
  • J. Li et al.

    Genomic segmental polymorphisms in inbred mouse strains

    Nat Genet

    (2004)
  • T.L. Newman et al.

    A genome-wide survey of structural variation between human and chimpanzee

    Genome Res

    (2005)
  • L. Feuk et al.

    Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies

    PLoS Genet

    (2005)
  • G.H. Perry et al.

    Hotspots for copy number variation in chimpanzees and humans

    Proc Natl Acad Sci U S A

    (2006)
  • C.M. Egan et al.

    Recurrent DNA copy number variation in the laboratory mouse

    Nat Genet

    (2007)
  • A.S. Lee et al.

    Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies

    Hum Mol Genet

    (2008)
  • G. Levinson et al.

    Slipped-strand mispairing: a major mechanism for DNA sequence evolution

    Mol Biol Evol

    (1987)
  • P.W. Messer et al.

    The majority of recent short DNA insertions in the human genome are tandem duplications

    Mol Biol Evol

    (2007)
  • G.H. Perry et al.

    The fine-scale and complex architecture of human copy-number variation

    Am J Hum Genet

    (2008)
  • J. Sebat et al.

    Strong association of de novo copy number mutations with autism

    Science

    (2007)
  • E. Gonzalez et al.

    The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility

    Science

    (2005)
  • G.H. Perry et al.

    Diet and the evolution of human amylase gene copy number variation

    Nat Genet

    (2007)
  • M. Fanciulli et al.

    FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity

    Nat Genet

    (2007)
  • E.J. Hollox et al.

    Psoriasis is associated with increased beta-defensin genomic copy number

    Nat Genet

    (2008)
  • B.E. Stranger et al.

    Relative impact of nucleotide and copy number variation on gene expression phenotypes

    Science

    (2007)
  • B. Frank et al.

    Copy number variant in the candidate tumor suppressor gene MTUS1 and familial breast cancer risk

    Carcinogenesis

    (2007)
  • T.J. Aitman et al.

    Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans

    Nature

    (2006)
  • S.A. McCarroll et al.

    Copy-number variation and association studies of human disease

    Nat Genet

    (2007)
  • J.S. Beckmann et al.

    Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability

    Nat Rev Genet

    (2007)
  • Cited by (0)

    a

    These authors contributed equally.

    View full text