Main

When the inherent redundancy of the genetic code was discovered, scientists were rightly puzzled by the role of synonymous mutations1. The central dogma of molecular biology suggests that synonymous mutations — those that do not alter the encoded amino acid — will have no effect on the resulting protein sequence and, therefore, no effect on cellular function, organismal fitness or evolution. Nonetheless, in most sequenced genomes, synonymous codons are not used in equal frequencies. This phenomenon, termed codon-usage bias (Fig. 1), is now recognized as crucial in shaping gene expression and cellular function through its effects on diverse processes, ranging from RNA processing to protein translation and protein folding. Naturally occurring codon biases are pervasive, and they can be extremely strong. Some species such as Thermus thermophilus avoid certain codons almost entirely. Synonymous mutations are important in applied settings as well — the use of particular codons can increase the expression of a transgene by more than 1,000-fold2.

Figure 1: Codon bias within and between genomes.
figure 1

The relative synonymous codon usage (RSCU)127 is plotted for 50 randomly selected genes from each of nine species. RSCU ranges from 0 (when the codon is absent), through 1 (when there is no bias) to 6 (when a single codon is used in a six-codon family). Methionine, tryptophan and stop codons are omitted. Genes are in rows and codons are in columns, with C- and G-ending codons on the left side of each panel. Note the extensive heterogeneity of codon usage among human genes. Other measures of a gene's codon bias include the codon adaptation index (CAI; the similarity of codon usage to a reference set of highly expressed genes)35, the frequency of 'optimal' codons (FOP)28 and the tRNA adaptation index (tAI; the similarity of codon usage to the relative copy numbers of tRNA genes)128.

We already enjoy a broad array of often conflicting hypotheses for the mechanisms that induce codon-usage biases in nature, and for their effects on protein synthesis and cellular fitness. Until recently, we have been unable to systematically interrogate these hypotheses through large-scale experimentation. As a result, despite decades of interest and substantial progress in understanding codon-usage biases, there is an overabundance of plausible explanatory models whose relative, quantitative contributions are seldom compared.

Advances in synthetic biology, mass spectrometry and sequencing now provide tools for systematically elucidating the molecular and cellular consequences of synonymous nucleotide variation. Such studies have refined our understanding of the relative roles of initiation, elongation, degradation and misfolding in determining protein expression levels of individual genes and the overall fitness of a cell. This information, in turn, is helping researchers to distinguish among the forces that shape naturally occurring patterns of codon usage. Researchers can also leverage high-throughput studies in applied settings that require controlled, heterologous gene expression, for example, to improve design principles for vaccine development and gene therapy.

Here we review the causes, consequences, and practical use of codon-usage biases. Because we already benefit from several outstanding reviews on naturally occurring codon biases3,4,5,6,7, we focus here on those classical hypotheses that remain unresolved and the recent developments arising from high-throughput studies. We begin by summarizing the empirical patterns of codon usage that are observed across species, across genomes and across individual genes. We describe the diverse array of mechanistic hypotheses for the causes of such variation and the sequence signatures that support them. Against this backdrop of hypotheses and sequence analysis, we describe experimental work that relates codon usage to endogenous gene expression and cellular fitness. From this, we turn to experimental studies on heterologous gene expression and their implications both for understanding natural synonymous variation and for engineering new constructs in applied settings.

Mechanistic hypotheses

Significant deviations from uniform codon choice have been observed in species from all taxa, including bacteria, archaea, yeast, fruitflies, worms and mammals. The overall codon usage in a genome can differ dramatically between species, although seldom between closely related species6.

Mutational versus selective hypotheses. Explanations for patterns of codon usage, within or between species, fall into two distinct categories that are associated with two independent forces in molecular evolution: mutation and natural selection3,4,5.

A mutational explanation posits that codon bias arises from the properties of underlying mutational processes — for example, biases in nucleotides that are produced by point mutations8, contextual biases in the point mutation rates or biases in repair. Mutational explanations are neutral because they do not require any fitness advantage or detriment to be associated with alternative synonymous codons. Mutational mechanisms are typically invoked to explain interspecific variation in codon usage, especially among unicellular organisms.

Explanations involving natural selection posit that synonymous mutations somehow influence the fitness of an organism, and they can therefore be promoted or repressed throughout evolution. Selective mechanisms are typically invoked to explain variation in codon usage across a genome or across a gene, although some interspecific variation is also attributable to such mechanisms (see below).

Selective and neutral explanations for codon usage are not mutually exclusive, and both types of mechanisms surely have a role in patterning synonymous variation within and between genomes3,5,9. Below we discuss the patterns of codon usage that have been documented at various levels of biological organization in light of their mutational or selective causes.

Patterns of codon usage

Patterns across species. The strongest single determinant of codon-usage variation across species is genomic GC content. In fact, differences in codon usage between bacterial species can be accurately predicted from the nucleotide content in their non-coding regions3,10. Genomic GC content is itself typically determined by mutational processes that act across the whole genome. As a result, most interspecific variation in codon usage is attributed to mutational mechanisms3,10, although the molecular causes of mutation biases are largely unknown10. Contrary to early expectations, the GC content of bacterial genomes or protein-coding genes is not correlated with optimal growth temperature (although, interestingly, structural RNAs show such a correlation)11.

In those species for which the point mutation rate depends strongly on the sequence context of a nucleotide — for example, in mammals, which experience hypermutable CpG dinucleotides — the mutational model predicts a strong context dependence of codon usage, which has indeed been observed12. Thus, at the genomic scale, neutral processes that do not discriminate among synonymous mutations remain plausible for explaining interspecific variation in codon usage among higher eukaryotes and they are well accepted as the primary determinants of interspecific variation in most other taxa (but see Ref. 13).

Aside from mutation biases, adaptation of codon usage to cellular tRNA abundances can also influence synonymous sequence variation across species (see below), as codon usage and tRNA regulation can co-evolve. Finally, some neutral processes that are responsible for codon bias across taxa are not mutational per se. Even in the absence of selection at synonymous sites, selection at non-synonymous sites can induce differences in nucleotide composition between coding and non-coding regions5,14,15,16.

Patterns across a genome. There is often systematic variation in codon usage among the genes in a genome, usually attributed to selection. In organisms, including Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and possibly also mammals (see below), there is a significant positive correlation between a gene's expression level and the degree of its codon bias, and a negative correlation between expression level and the rate of synonymous substitutions between divergent species9,17,18 — features that are difficult to explain through mutation alone. Although mutational effects could possibly co-vary with expression levels, because transcription can be mutagenic19,20, this effect is unlikely to account for the correlations between codon usage and expression levels that are observed in numerous species5,19,21.

The classic explanation for systematic variation across a genome is selectionist: codon bias is more extreme in highly expressed genes to match a skew in Iso-accepting tRNAs and, thereby, provide a fitness advantage through increased translation efficiency or accuracy of protein synthesis9,17,22,23,24,25,26,27. There is strong evidence for this hypothesis in several species, mostly in the form of broad correspondences between the 'preferred codons' that are used in highly expressed genes and measures of relative tRNA abundances28,29,30,31,32. As a result, translational selection remains the dominant explanation for systematic variation in codon usage among genes, despite the fact that supporting evidence is sometimes incomplete: direct measurements of tRNA abundances are rare in higher eukaryotes; the correspondence of tRNA abundance with tRNA copy number5 is weak in D. melanogaster and humans33; and 30% of bacterial species show no evidence of translational selection34.

There are two possible directions of causality relating an endogenous gene's expression level and the degree of its codon adaptation35 to tRNA abundances. In one view2,36,37 high codon adaptation induces strong protein expression, because rapid and/or accurate elongation increases a given protein's rate of synthesis; in the other view, strong expression selects for high codon adaptation to avoid costs that scale with a gene's expression level. In the biotechnology literature, the former interpretation is de rigueur, whereas in the literature on molecular evolution, the latter interpretation prevails3,4,5. The idea that high codon adaptation induces high protein levels per mRNA molecule does not square well with the notion that initiation is generally rate limiting for endogenous protein production7,9,38,39 (although it may apply to heterologous genes (see below)). When initiation is limiting, the elongation rate should not influence the amount of protein that is produced from a given message7,9 (Fig. 2). Moreover, from an evolutionary perspective, if high protein levels are desirable, it would seem easier to tune a promoter for increased transcription than to select on hundreds of individual synonymous mutations, each of which has only a marginal effect on the overall amount of protein synthesis. Conversely, the use of poorly adapted codons to slow the translation of genes expressed at low levels36 would seem wasteful compared to simply reducing transcription or slowing initiation.

Figure 2: Relationships between initiation rate, elongation rate, ribosome density and the rate of protein synthesis for endogenous genes.
figure 2

The steady-state rate of protein synthesis and density of ribosomes bound on an mRNA both depend on the rates of initiation and elongation. When elongation is the rate-limiting step in a gene's translation (case A), the message will be covered as densely as possible by ribosomes and faster elongation will tend to increase the rate of protein synthesis. However, most endogenous genes are believed to be initiation limited (cases B, C and D), so that their transcripts are not completely covered by ribosomes. This is evidenced by extensive variability in ribosome densities across endogenous mRNAs67. For two initiation-limited genes with the same initiation rate, the mRNA with faster elongation (afforded by, say, higher codon adaptation to tRNA pools) will have a lower density of translating ribosomes (C versus B) but no greater rate of termination. Thus, when initiation is limiting, high codon adaptation should not be expected to increase the amount of protein that is produced per mRNA molecule (protein amounts are the same in B and C). A lower density of ribosomes can also occur when two initiation-limited genes have the same elongation rate, but one has a slower initiation rate (D versus C). In this case, the amount of protein that is produced will be lower for the mRNA that has the slower initiation rate (D). The extent to which variation in ribosome densities67 arises from variation in initiation versus elongation rates remains to be determined. In all cases shown here, as is true for most endogenous genes, the gene's mRNA does not account for a substantial proportion of total cellular mRNA, so that the rates of initiation and elongation do not substantially alter the pool of free ribosomes (compare with Fig. 4).

Although evolutionary studies generally agree that high expression selects for high codon adaptation in endogenous genes (as opposed to the converse), the precise nature of fitness gains associated with translationally adapted codons remains a topic of active debate (Box 1). Furthermore, even though translational efficiency is energetically beneficial to the cell, efficient translation generally increases the amount of cell-to-cell variation in expression levels40 and this noise is typically deleterious41. Although translational selection has received the most attention, systematic variation in codon usage across a genome can also be caused by neutral processes in certain species; these processes include horizontal gene transfer42, different nucleotide bias in leading and lagging strands of replication in bacteria43 and isochore structure in mammals (Box 2).

Patterns across a gene. Codon usage can vary dramatically even within a single gene. Synonymous mutations at specific sites may experience selection because they disrupt motifs that are recognized by transcriptional or by post-transcriptional regulatory mechanisms, for example, microRNAs. Sites that require ribosomal pausing for proper co-translational protein folding or ubiquitin modification44 may experience selection for poorly adapted codons45 or strong mRNA folding46. Codon choice that promotes proper nucleosome positioning is selectively advantageous in eukaryotes, especially in 5′ regions47. And, finally, in mammals, synonymous mutations near an intron–exon boundary can create spurious splice sites or disrupt splicing control elements4,48, causing disease4. This phenomenon helps to explain the reduced rate of synonymous substitutions and SNP density near splicing control elements49,50. Selection for proper splicing also extends to D. melanogaster, and sequence variation suggests that it is probably an even stronger force than translational selection in shaping codon usage near intron–exon boundaries51.

Although important, the mechanisms of intragenic codon-usage variation described above are typically restricted to specific taxa or special classes of sites. Recent studies have argued for three mechanisms that produce systematic variation in codon usage across the sites in a gene in a diverse range of species.

One of these mechanisms is selection against strong 5′ mRNA structure to facilitate translation initiation. mRNA structure near the 5′ end of a coding region is generally disadvantageous9 as it can inhibit ribosomal initiation52,53 (Fig. 3a). Eyre-Walker and Bulmer proposed selection against mRNA structure to explain a trend towards reduced codon adaptation in the 5′ region of E. coli genes and a corresponding reduced rate of synonymous substitutions across divergent species54. More recently, following similar observations in E. col i55, Gu et al. demonstrated a broad trend in all sequenced prokaryotes and eukaryotes towards reduced mRNA stability near the translation initiation sites of genes, especially for GC-rich genes56. This study relied on computational predictions of mRNA structure in short windows; combined with large-scale experimental studies (see below), this work suggests a systematic role for selection on mRNA structure in shaping codon usage in the first 30–60 nucleotides of genes.

Figure 3: Effects of mRNA secondary structure on translation initiation in bacteria.
figure 3

a | Structure in the ribosome binding site (RBS) usually inhibits initiation. However, initiation can occur when the structured element is positioned between the Shine–Dalgarno sequence (SD) and the start codon (AUG)129, or 15 nucleotides downstream of the start codon130,131. b | Synonymous mutations in the region from nucleotide −4 to +37 of a GFP gene alter the predicted folding energies by up to 12 kcal mol−1. A 5′ mRNA folding energy of below −10 kcal mol−1 strongly inhibits GFP expression in Escherichia coli55. c | More than 40% of human genes have predicted 5′ folding energies below the −10 kcal mol−1 threshold and are therefore expected to express poorly in E. coli without modification. AU, arbitrary units. Part b is modified, with permission, from Ref. 55 © (2009) American Association for the Advancement of Science.

Tuller et al.57 recently described a second, systematic trend in the pattern of intragenic codon usage: a 'ramp' of poorly adapted codons in the first 90–150 nucleotides of genes, which had earlier been observed in bacteria, yeast and fruitflies58,59. This pattern has been preserved across divergent species even when tRNA pools (estimated from gene copy numbers) have changed57. A ramp of poorly adapted codons presumably slows elongation at the start of a gene, which may provide several physiological benefits. Slow 5′ elongation is predicted to reduce the frequency of ribosomal traffic jams towards the 3′ end57,60, thus reducing the cost of wasted ribosomes and of spontaneous or collision-induced abortions. Alternatively, a ramp of slow elongation may facilitate recruitment of chaperone proteins to the emergent peptide61. Other explanations, unrelated to elongation rate, are also plausible such as weaker selection for accurate translation near the start of a gene, where missense and nonsense errors would be less costly24,59. The earliest interpretation of unusual 5′ codon usage posited selection to increase the initiation rate9; interestingly, the 5′ region of poorly adapted codons identified by Tuller et al. overlaps significantly with the region in which synonymous codon choice systematically reduces mRNA stability54,55,56,58. It remains unclear which selective mechanisms are primarily responsible for the unusual and nearly universal pattern of 5′ codon usage. Multiple mechanisms may certainly operate in different genes; however, it is unclear why a single gene should experience selection both to increase its rate of ribosomal initiation9 and to reduce the subsequent rate of its early elongation57.

Cannarozzi et al.62 recently exposed a third, novel pattern of intragenic codon usage in eukaryotes: the re-use or autocorrelation of codons across a gene sequence, driven, they argue, to improve elongation efficiency through tRNA recycling. If a recently used tRNA molecule is bound to the ribosome, or if it diffuses slowly compared to ribosomal progression and re-acylation63, then it would be efficient to re-use the same tRNA molecule for subsequent incorporations of the same amino acid. This physical model predicts selection for using the same codon or, more generally, a codon that is read by the same tRNA species, at nearby sites in a gene that encode the same amino acid. Indeed, Cannarozzi et al. observed significant autocorrelation of codons across gene sequences in most eukaryotes, especially in genes that are rapidly upregulated in response to stress. Of course, autocorrelation would also be predicted if all sites in a gene independently experience pressure for biased codon usage, for example, to match the global pool of tRNAs. To control for overall codon usage, Cannarozzi et al. compared the degree of autocorrelation in actual gene sequences to gene sequences that had been reshuffled at random, finding more autocorrelation on average in the unshuffled genes, although only marginally so. More convincingly, they observed that autocorrelation is strongest for iso-accepting codons of rare tRNAs in highly expressed genes, which is predicted by the tRNA-recycling hypothesis but not by a selective pressure that applies at all sites independently.

Measurements of endogenous expression

Recent developments in mass spectrometry and fluorescence microscopy allow large-scale measurements of endogenous protein levels64,65,66. Together with techniques for quantifying ribosomal occupancy67 and measuring elongation dynamics68, these advances provide a spectacularly detailed account of basic cellular processes, with implications for our understanding of codon biases.

Variation in protein/mRNA ratios. Shotgun proteomics have revealed an extensive role for post-transcriptional processes in determining eventual protein levels in bacteria, yeast69, worms70, fruitflies70 and especially mammals65,66. Whereas the imperfect correlations between protein and mRNA levels (R2 ≈ 47–77% in E. col i65,71, 73% in yeast65 and 29% in humans66) may previously have been seen as measurement noise, researchers have since attributed much of the variation in protein/mRNA ratios to sequence-derived characteristics of genes. In a recent study in human cells66, the strongest correlates of steady-state protein levels, controlling for mRNA levels, were coding-sequence length (reflecting the fact that longer transcripts are less stable72 or slower to initiate73), amino acid content (reflecting the variable costs associated with synthesizing different amino acids, or variable rates of protein degradation) and predicted 5′ mRNA structure (reflecting lower initiation rates when the 5′ structure is strong). Importantly, the codon adaptation index35, which correlates strongly with mRNA levels in yeast65 and weakly in humans66, shows little or no significant correlation with the amount of protein per mRNA molecule in either organism65,66; this suggests that codon adaptation does not significantly increase the protein yield from a given message, at least among endogenous genes74,75. It is important to note that steady-state protein levels are influenced by both protein production and protein degradation, so any variation in degradation rates unrelated to codon usage will further reduce the correlation between codon usage and protein/mRNA ratios.

Ribosomal footprints. Ingolia et al. recently devised a clever application of RNA–seq to quantify ribosome-protected RNA fragments in a cell, thereby estimating 'ribosomal footprints' across the transcriptome67. This method has provided rich information about translational regulation and it has uncovered some startling phenomena, such as an abundance of upstream ORFs with non-AUG start codons. The footprint data in yeast show a greater mean density of ribosomes in the first 100–150 codons of genes, suggesting locally slow elongation; this is consistent with the observed presence of poorly adapted codons in the 5′ region58,59. There is also a significant negative correlation, genome-wide, between a transcript's ribosome density and the experimentally measured strength of mRNA structure near its start site76 suggesting that strong 5′ mRNA structure retards translational initiation and reduces the density of translating ribosomes.

Remarkably, on averaging data from all yeast genes, Tuller et al.37 also observed a negative correlation between predicted mRNA folding energy and ribosome density among the first 65 codons, suggesting that strong mRNA structure downstream of the start site retards translational elongation. This observation is surprising, given the helicase activity of translating ribosomes77; however, the correlation between the genome-wide average profiles of mRNA folding and ribosome density does not imply a correlation at the level of individual sites. Ingolia et al. also measured ribosomal footprints under amino acid starvation and found that one-third of yeast genes showed substantially increased or decreased translational efficiency in these conditions compared with controls67. A detailed parsing of the relationship between a gene's amino acid content and translational response to starvation may improve design principles for overexpressed heterologous genes, which often induce starvation78,79 (see below).

Translational efficiency. Notions of translational efficiency differ in the literature on gene expression. Ingolia et al.67 defined the translational efficiency of a gene as the number of bound ribosomes per mRNA molecule; by contrast, Tuller et al.37,57 and others defined efficiency as protein yield per mRNA molecule (that is, the ratio of protein abundance to mRNA abundance). The second definition is more relevant to issues of total protein synthesis, whereas the former definition may be more relevant to ribosomal availability and overall cellular fitness. These two notions of translational efficiency are only weakly correlated for endogenous genes (R2 < 2.5% comparing the data by Ingolia et al.67 to Ref. 80), indicating that the density of ribosomes on a given mRNA molecule does not determine the amount of protein that is produced from it. Similarly, in yeast, a gene's codon adaptation index35 explains less than 3% of the variance in protein abundance per mRNA67. Both of these observations are consistent with the idea that, for most endogenous genes, the initiation is rate limiting for protein production38,39 and therefore determines the amount of protein produced from each message, regardless of ribosome density or codon adaptation7,9 (Fig. 2); however, this logic may not apply to overexpressed heterologous genes, which are described in the following section (Fig. 4).

Figure 4: The elongation rate may influence the rate of protein synthesis for an overexpressed gene.
figure 4

Unlike most endogenous genes, mRNA from an overexpressed transgene may account for a substantial proportion of total cellular mRNA. In this case, slow elongation (caused by poor codon adaptation to charged tRNA pools, say) can increase the density of bound ribosomes and thereby reduce the pool of available ribosomes in the cell. Such a depletion of available ribosomes will feed back to reduce the initiation rate of subsequent translating ribosomes on the message, thereby reducing the rate of protein synthesis. This is illustrated schematically by comparing overexpressed mRNAs with slow elongation (above) and rapid elongation (below), but identical initiation sequences. Thus, the relationship between codon adaptation and the rate of protein synthesis per mRNA molecule may differ for an overexpressed transgene compared to an endogenous gene (Fig. 2).

Measurements of heterologous expression

Codon bias has a crucial role in heterologous gene expression. However, there is often a disconnection between technological and evolutionary studies of codon bias — a gap that partly reflects genuine differences between endogenous and heterologous situations. In many biotechnological applications, a transgene is massively overexpressed, accounting for up to 30% of the protein mass in cell. As a result, the principles that relate heterologous codon usage to protein levels may differ substantially from the endogenous case.

The idea that initiation generally limits translation may not apply to an overexpressed transgene whose mRNA accounts for a large proportion of total cellular mRNA. In such a case, inefficient use of ribosomes along the overexpressed mRNA may be sufficient to feed back and significantly deplete available ribosomes, thereby reducing initiation rates and retarding further heterologous protein production9 (Fig. 4). Thus, we might expect that the elongation effects of codon usage will influence protein yields per mRNA molecule for overexpressed genes. Nonetheless, we should not necessarily expect that the codons that are adapted to efficient elongation for endogenous genes will correspond to the efficient codons for heterologous genes, because overexpression causes amino acid starvation and concomitant alternations in the abundances of charged tRNAs78,79,81. Indeed, there was no significant correlation between codon adaptation35 and expression levels in two large-scale systematic experiments55,79. In fact, even endogenous genes that are essential during amino acid starvation such as amino acid biosynthetic enzymes preferentially use codons that are poorly adapted to the typical pool of charged tRNAs, but are well adapted to starvation-induced tRNA pools78,79.

Despite the complications described above, the field of codon optimization has traditionally focused on adjusting codon usage to match cellular tRNA abundances in standard conditions, disregarding other dimensions of bias. However, strategies are now changing. Several recent studies advocate for the role of global nucleotide content82,83, local mRNA folding55,84, codon pair bias85, a codon ramp57 or codon correlations62 in optimizing heterologous expression (Table 1).

Table 1 Coding-sequence covariates of gene expression and other sources of codon bias that are unrelated to gene expression

Effects of codon adaptation on expression levels. Many studies show strong effects of rare codons on heterologous expression. In E. coli, stretches of rare AGA or AGG codons cause ribosome pausing and co-translational cleavage of mRNA86, ribosomal frameshifting87 or amino acid misincorporation88. Consistent with theoretical expectations, codons that are read by rare tRNAs can slow elongation by several fold89. And stretches of AGG codons near the ribosome-binding site (RBS) can reduce protein yields by obstructing translation initiation90. Although such studies are convincing, they usually address the effect of a subset of rare codons, often in long stretches, in E. coli cells; it is not known whether these principles can be applied in general.

Observations such as those above were quickly followed by efforts to adjust the global codon adaptation of transgenes to cellular tRNA abundances. Several approaches have been proposed: 'CAI maximization' replaces all codons by the most preferred codons in the target genome, but this could result in unbalanced charged tRNA pools2; 'codon harmonization'91 puts some non-preferred codons in positions that correspond to predicted protein domain boundaries; and 'codon sampling' adjusts the codon usage to reflect the overall usage in the target genome. In the absence of tRNA abundance estimates, codon frequencies in the target genome are sometimes used. It has also been suggested that codon usage should match the profile of charged tRNAs rather than total tRNAs79,81. The utility of codon adaptation approaches is still unclear, as they have not been systematically compared against each other, and several anecdotal studies argue both for (for example, Ref. 92) and against (for example, Ref. 93) their efficiency.

Codon adaptation algorithms typically optimize many sequence properties at once. This makes it difficult to determine which parameter causes observed differences in expression. In two recent multi-gene studies, between 60% and 70% of genes experienced increased expression upon codon optimization94,95 but whether this was a direct consequence of increased codon adaptation or other sequence properties is unclear. In our study of 154 synonymous variants of GFP, we observed no significant correlation between the codon adaption index35 and expression levels in E. col i55, but a weak positive correlation was later found using nonlinear regressions37,96. In any case, adaptation of codon usage is limited to species with pronounced and well-understood variation in tRNA concentrations, such as bacteria and yeast.

Effects of nucleotide bias on expression levels. Nucleotide biases are pervasive in natural genes and have the potential to alter the interactions of mRNA with DNA, with proteins and with itself, thereby influencing RNA production, degradation and translation rates. Many of these effects are characterized, but this knowledge has yet to find its way into standard codon optimization procedures.

GC-rich mRNAs can form strong secondary structures, and, in bacteria, strong structure near the RBS prohibits initiation53,55,97 (Fig. 3b). As a result, more than 40% of human genes would be expected to express poorly when placed in E. coli without modification (Fig. 3c). Strong structure near the start codon reduces heterologous expression in yeast as well (G.K., unpublished observations), consistent with evolutionary analyses56. No such effect has been described in mammals; on the contrary, high GC content generally increases expression levels in mammalian cells (see below). However, a strong mRNA hairpin in the coding sequence has been reported to interfere with translation in mammalian cells84 and strong hybrids between RNA and DNA (the R-loops) may interfere with transcription98.

GC-poor mRNAs are unlikely to fold strongly, but they often carry other sequence elements that limit expression. For example, low GC content is commonly believed to limit the expression of Plasmodium falciparum genes in E. coli, although the mechanisms are unknown. Such mRNAs may be targets for RNase E, which cleaves AU-rich sequences with low sequence specificity99. The situation is slightly clearer in mammals, in which low GC content (or high A content) has been shown to reduce expression82,83. This effect is common knowledge in virology, as HIV and human papilloma virus (HPV) genes are poorly expressed in human cells unless the gene sequences are optimized100,101. The rate-limiting step in these cases may be transcription or nuclear RNA export82,83, which is consistent with the efficient expression of GC-poor genes in cytoplasmic transcription systems based on the vaccinia virus101.

Little is known about the functional consequences of replication strand-related bias or CTAG avoidance, which are common in prokaryotes. High CpG content was reported to correlate with high expression in mammalian cells102, possibly by altering the distribution of nucleosomes on DNA.

Other effects of synonymous mutations on expression levels. Other examples of synonymous mutations influencing expression have been described as primarily anecdotal observations. In E. coli, overrepresented codon pairs103 were proposed to decrease translation elongation rates104, although this conclusion was later disputed105. In an attempt to produce attenuated strains, Coleman et al.85 partially de-optimized codon pairs in the poliovirus genome and observed a reduction in protein yield of several fold and a reduction in viral infectivity of 1,000-fold in mammalian cells. A version of GFP with autocorrelated codon usage showed 30% lower ribosome density in yeast, suggesting faster elongation, than a version with anticorrelated codon usage62. And a synonymous mutation in the human multidrug-resistance protein 1 (MDR1) gene was proposed to influence mRNA stability106 or protein folding and substrate specificity107. These observations are all intriguing and form important avenues for future systematic studies to determine their molecular bases.

Conclusions

Recent years have begun to see a convergence of experimental work on endogenous and heterologous gene expression, as both types of studies take advantage of high-throughput, quantitative techniques. Heterologous studies using large libraries of random or unbiased synonymous sequence variation55,81,97 are especially important for uncovering and comparing general rules to optimize expression. By contrast, relatively small-scale studies based on preconceived notions of 'optimized' codon usage do not provide sufficient power to distinguish among alternative mechanisms, nor do they allow us to discover any new mechanisms that increase expression. Heterologous studies will be complemented by endogenous measurements of initiation and elongation dynamics, and their effects on protein synthesis as a function of a gene's amino acid content and transcript level.

In the short term, there will be a trade-off between gaining predictive power for transgene optimization and deducing the underlying mechanisms that link codon usage and gene expression. High-dimensional, statistical regressions applied to large libraries of synonymous genes81,96 provide a principled, effective means of increasing heterologous expression. Such techniques are increasingly valuable in applied contexts in which high expression is required — such as viral-delivered gene therapies108,109 — but they do not generally identify molecular mechanisms. Our hope, over the long term, is that cross-fertilization between biotechnological and molecular biological studies will elucidate effective strategies for designing transgenes, as well as the mechanistic principles that underlie their expression.