Introduction

The expression of functional proteins in heterologous hosts is a cornerstone of modern biopharming. However, many human proteins are often difficult to express in unicellular organisms such as bacterium and yeast. The underlying problem is that codon bias has a profound impact on the heterologous expression of human proteins in these organisms. Codon usage has been found to be the single most important factor in prokaryotic gene expression. Therefore laborious and time consuming codon optimization is often necessary to achieve a successful expression in unicellular organisms (Gustafsson et al. 2004). Contrary to unicellular organisms, no obvious codon bias has been observed among human and several other mammals in previous studies. Thus animal mammary gland is considered an ideal bioreactor for producing functional human proteins without codon optimization. With this concept in mind, a number of transgenic livestock have been created to produce different recombinant human proteins in their milk, and in 2006 a recombinant human protein purified from a transgenic goat was approved for clinical use in Europe by the European Medicine Agency (Houdebine 2009).

However, considerable variation in expression efficiency has been found in the heterologuous expressions of human proteins in the milk of transgenic animals produced in our lab and other groups. Generally a number of proteins such as serpin peptidase inhibitors and immunoglobulins which are abundant in human tissues other than mammary gland tend to exhibit a higher expression level in transgenic milk, while many other human proteins such as interleukin-2, coagulation factor 8 and catalase are difficult to express in milk (Wright et al. 1991; Tang et al. 2008; Buhler et al. 1990; Niemann et al. 1999; He et al. 2008a, b). These examples cause us to question whether tissue specific codon usage pattern affects translational regulation during the heterologuous expression of human proteins in animal mammary gland?

Recently several independent studies concluded that codon bias might be a factor involved in translation regulation in humans. One study concluded that genes selectively expressed in one human tissue can often be discriminated from genes expressed in another tissue on the basis of their synonymous codon usage (Plotkin et al. 2004) while another study reported that the amount of tRNA varies widely among different human tissues based on microarray results, further more they showed that the relative tRNA abundance significantly correlates with codon usage of tissue specific genes (Dittmar et al. 2006). The effective number of codons (ENC) is a most common index for measurement of codon bias. ENC is analogous to the effective number of alleles in population genetics. However, ENC can not reveal which codons are more frequent than others but rather indicates the overall departure from random synonymous codon choice. As a result, two genes may exhibit the same degree of overall bias, but differ dramatically in their particular choice of synonymous codons. Thus in this study, we used a two-tailed Fisher exact test to measure the distance of synonymous codon usage between two genes (Plotkin et al. 2004). Unlike metrics such as ‘‘relative synonymous codon usage’’, which are noisy when applied to individual genes, the Fisher exact test for small sample sizes based codon usage measure can be applied to genes that contain only a few examples of each amino acid.

We propose two scenarios: first, the different expression levels of recombinant human proteins in the milk of transgenic animals is due to the variation in synonymous codon usage patterns between mammary gland and other human tissues; or, second, protein tertiary structure may influence mammary transgene expression.

Human and mouse tissue specific codon usage

We have made pair-wise comparisons of codon usage among seven human tissues. When comparing heart to kidney (Fig. 1), virtually all kidney associated genes are clustered in a separate middle clade from the heart associated genes. The observed separation between these two classes of genes would not have occurred by random chance (P < 0.001). The observed clustering is the result of systematic differential codon usage between heart and kidney specific genes. Fig. 1 indicates that we can generally discriminate between heart and kidney expressed genes on the basis of their codon usage alone. Similarly, kidney specific genes can be discriminated from lung and pancreas specific genes (supplementary Fig. 1). However, many pairs of tissue specific gene sets do not exhibit significantly different codon usage patterns (e.g., brain versus pancreas, P = 0.384; supplementary Fig. 2). Unexpectedly, in the tested six mouse tissues, we can not observe any pair of tissues that can be separated from each other on the basis of their codon usage with a statistical significant test. Only heart specific genes can be nearly discriminated from liver specific genes (P = 0.072; supplementary Fig. 3).

Fig. 1
figure 1

A dendogram reflecting the codon usage of 22 genes selectively expressed in heart (red) and 17 genes selectively expressed in human kidney (black). The pairwise distances underlying this tree reflect the degree to which the genes differ in their codon usage. As this tree demonstrates, heart-specific genes can generally be distinguished from heart-speicific genes purely on the basis of their synonymous codon usage. The observed separation between these two classes of genes would not have occurred by random chance (P < 0.001)

When comparing human mammary gland to six other tissues, only heart, lung and pancreas specific genes can be discriminated from mammary gland specific genes on the basis of codon usage (heart versus mammary gland, P < 0.001; lung versus mammary gland, P = 0.002; pancreas versus mammary gland, P < 0.001; supplementary Fig. 4), and the other three tissues in the test can not be distinguished from mammary gland (brain versus mammary gland, P = 0.368; kidney versus mammary gland, P = 0.368; liver versus mammary gland, P = 0.536; supplementary Fig. 5). In the tested six mouse tissues, only genes selectively expressed in heart and pancreas can be distinguished from mouse mammary gland specific genes (heart versus mammary gland, P = 0.002; pancreas versus mammary gland, P = 0.038; supplementary Fig. 6). Thus there does appear to be codon usage differences between mammalian tissues.

Does expression levels of recombinant human proteins in transgenic milk correlate with mammary gland specific codon usage?

Successful examples of expressing recombinant human proteins in transgenic animals to date are summarized in Table 1 and the greatest expression level of the 31 recombinant proteins reported in Table 1 were showed in Fig. 2. When comparing codon usage of the 31 recombinant proteins expressed in the milk of transgenic animals to human mammary gland specific genes and milk proteins, we found several most efficiently expressed recombinant human proteins (SERPINC1, ATCD20IgL, REG3A, LTF, FGA, FGB and FGG) were clustered close to the mammary gland specific genes and milk proteins. However, we also found the moderately expressed FIX and those less efficiently expressed recombinant proteins (LYZ, CAT, IL2 and mCol18a1) were clustered close to most human mammary gland specific genes and milk proteins. The observed three classes of genes can not be discriminated from each other (mammary gland, P = 0.574; milk proteins, P = 0.900; supplementary Fig. 7). Similar results can be observed in the comparing of recombinant proteins to mouse mammary gland specific genes and milk proteins (mammary gland, P = 0.952; milk proteins, P = 0.974; supplementary Fig. 8). Further comparing codon usage of milk proteins among 19 different mammals showed that most milk proteins were prone to use similar codon usage patterns among different mammalian species (supplementary Fig. 9). Thus we compared the codon usage of recombinant proteins to milk proteins in the main five livestock cow, sheep, goat, rabbit and pig. In each animal, except for certain proteins, the cluster result was quite similar to human and mouse (Fig. 3). Thus we dismiss our proposal that expression levels of recombinant human proteins in the milk of transgenic animals correlate with mammary gland specific codon usage patterns.

Table 1 Expression of recombinant human proteins in the milk of transgenic animals
Fig. 2
figure 2

Expression levels of recombinant human proteins in the milk of transgenic animals. The expression level of each protein presents the highest one among different studies summarized in Table 1; ATCD20 IgH and ATCD20 IgL indicate the heavy and light chain of human anti-CD20 monoclonal antibody; FGA, FGB and FGG note the alpha chain, beta chain and gamma chain of fibrinogen (FIB) respectively; mCol18a1 stands for mouse collagen, type XVIII, alpha 1; star notes the expression level of recombinant protein with a cDNA based expression construct; open circle indicates the expression level values of hFVIII(0.0027 mg/mL), hIL-2(0.000043 mg/mL) and mCol18a1(0.00003 mg/mL) which all are too low to be fully presented on the top of the bar

Fig. 3
figure 3

Dendograms reflecting the codon usage of milk proteins of the main five livestock (red) and recombinant proteins. The top 13 high level expressed recombinant proteins (≧5 mg/mL) are showed in blue, and moderate and lower expressed recombinant proteins are indicated in black. The three classes of genes (indicated in three different colour) can not be clearly discriminated from each other (cow, P = 0.992; goat, P = 0.982; sheep, P = 0.992; rabbit, P = 0.962; pig, P = 0.994)

Recombinant human proteins with greater expression levels in transgenic milk share similar protein domains with milk proteins

The main domains of recombinant proteins expressed in the milk of transgenic animals and mammalian milk proteins derived from CATH were summarized in Table 2. Casein makes up the main component of milk proteins, as in bovine milk, it reaches as great as 82% of the total milk proteins (Jensen 1995). When we investigated the main domains of the main four caseins [casein alpha S1 (CSN1S1), casein alpha S2 (CSN1S2), casein beta (CSN2), and casein kappa (CSN3)], it can be observed that all four caseins share a similar alpha–beta based major domain (Fig. 4). With the exception of CSN1S2 which is composed of an alpha–beta barrel domain, the other three caseins all are composed of a 3-layer (aba) sandwich shaped major domain. Interestingly we found that within the top 13 high level expressed recombinant proteins in the milk of transgenic animals (expression level ≧5 mg/mL), 10 proteins share a similar alpha–beta based major domain with casein proteins, especially those with extra-higher expression levels such as SERPINA1, SERPINC1, SERPING1, REG3A, LTF and BCHE which all possess a 2-layer or 3-layer (aba) sandwich shaped major domain similar to the most abundant milk proteins CSN1S1, CSN2 and CSN3. Beside the 10 high level expressed proteins, there are another two proteins CEL (1.00 mg/mL) and Calc1(2.10 mg/mL) each has a CSN2 like 3-layer (aba) sandwich domain and mainly alpha beta domain, respectively, though these two recombinant proteins do not get an extra-higher expression levels in the transgenic milk. However, the moderate expression levels are probably due to their cDNA based expression constructs. Because the gene structure used in an expression construct seems to have significant impact on expression level, generally a genomic DNA sequence results in several orders of magnitude greater expression level than a cDNA sequence (Whitelaw et al. 1991).

Table 2 The main domains of recombinant proteins expressed in transgenic milk
Fig. 4
figure 4

The tertiary structures of the main domains of casein proteins, 10 recombinant human proteins highly expressed in the milk of transgenic animal, and two moderately expressed proteins CEL (1.00 mg/mL) and Calc1(2.10 mg/mL) based on cDNA expression constructs. Figures were made using VMD (http://www.ks.uiuc.edu/Research/vmd/) and rendered using Snapshot

Promoter plays a critical role at transcriptional regulation of transgene expression. Milk protein gene promoters must be used when expressing recombinant proteins in animal mammary gland. Promoters derived from κ-casein and αS2-casein are particularly weak (Houdebine 2000). In Table 1, none of the reported studies used these two kinds of promoter. Bovine αS1-casein, goat β-casein, mouse murine acidic protein, and ovine β-lactoglobulin derived promoters were most popularly used in these studies. Under the regulation of ovine β-lactoglobulin promoter and using genomic DNA of the foreign gene, transgenic sheep expressed high level of SERPINA1 (35 mg/mL) and FIB (5 mg/mL) which both possessing major domains similar to caseins, whereas lower level of FIX (1 mg/mL) which has a major domain distinct with caseins. Similarly, under the control of goat beta-casein promoter, transgenic goats expressed high levels of SERPINC1 (20 mg/mL) and SERPINA1 (14 mg/mL), whereas lower levels of AFP (1.1 mg/mL) and tPA (3 mg/mL);furthermore, under the control of bovine αS2-casein, transgenic rabbits expressed high levels of SERPING1 (12 mg/mL) and GAA (8 mg/mL), whereas lower levels of tPA (0.05 mg/mL), NGF(0.25 mg/mL) and IGF1(1 mg/mL). These cases may indicate protein structure appears to be an important factor affecting the expression level of transgene in the mammary gland. Taken together, we suggest that those recombinant proteins share similar major domains to casein proteins may have potential to achieve a greater expression level in the milk of transgenic animals when they are under similar transcriptional regulation.

Materials and methods

Gene sequences analysis

Coding sequences of genes in this study were obtained from GenBank. On the basis of BioGPS (http://biogps.gnf.org/#goto=welcome) and several other extensive mRNA expression microarray studies (Warrington et al. 2000; Hsiao et al. 2001; Liang et al. 2006; Shyamsundar et al. 2005; Saito-Hisaminato et al. 2002), we have identified genes which are selectively expressed in human and mouse mammary gland at lactation stage and in six other tissues: human [mammary gland (13 genes), brain (25 genes), heart (22 genes), kidney (17 genes), liver (26 genes), lung (19 genes) and pancreas (27 genes; S1)] and mouse [mammary gland (15 genes), brain (24 genes), heart (22 genes), kidney (13 genes), liver (24 genes), lung (17 genes) and pancreas (21 genes; S2)].

The sequences of animal milk protein genes and recombinant proteins were summarized in supporting materials Table S3 and Table S4 respectively.

Codon usage analysis

For this study, the distance of synonymous codon usage between two genes were measured based on a two-tailed Fisher exact test method (Plotkin et al. 2004). Briefly, degree of codon bias in common sense as the departure from random synonymous codon choice is not concerned, but rather the degree to which genes differ in their encoding of amino acids is concerned. Given the coding sequences for a pair of genes, absolute frequency of each codon in each gene is first tabulated with condonW (http://codonw.sourceforge.net//). For each amino acid, a two-tailed Fisher exact test on the n × 2 contingency table given by the frequencies of the amino acid’s synonymous codons is calculated (e.g., for Ser n = 4: TCA, TCC, TCG, and TCT). As a result, for each amino acid a P-value indicating whether or not the genes use significantly different codons to encode that amino acid is obtained. For example, when comparing human brain (25 genes) to heart (22 genes), distance between codon usage of every pair of genes (including pairs from the same tissue) is calculated, thus obtaining a 47-by-47 symmetric matrix of pairwise distances. Distance between two genes is given by the number of amino acids that exhibit significantly different (P < 0.01) codon usage, as defined above (detailed SAS program was presented in supplementary materials S5). Then by using the neighbor joining method (PHYLIP v3.66; http://evolution.gs.washington.edu/phylip.html), a dendogram that graphically represents the measured pairwise distances between the codon usage in the study genes can be produced. To test whether observed clustering of genes in a dendogram is nonrandom, P value is calculated by comparing the observed summed distances along the tree between genes of the same tissue against a null distribution produced by randomly permuting the labels of the leaves.

Protein main domain analysis

Domains of recombinant proteins expressed in the milk of transgenic animals and mammalian milk proteins were evaluated with CATH (http://www.cathdb.info/) which is a hierarchical classification of protein domain structures based on clustering proteins at four major levels: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). Class is determined according to the secondary structure composition and packing within the structure. Three major classes are recognized; mainly-alpha, mainly-beta and alpha–beta. Architecture describes the overall shape of the domain structure as determined by the orientations of the secondary structures but ignores the connectivity between the secondary structures. It is currently assigned manually using a simple description of the secondary structure arrangement e.g. barrel or 3-layer sandwich. Topology indicates that structure are grouped according to whether they share the same topology or fold in the core of the domain, that is, if they share the same overall shape and connectivity of the secondary structures in the domain core. Homologous superfamily groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous. Similarities are identified either by high sequence identity or structure comparison using SSAP. Boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis. Generally if a given protein chain has sufficiently high sequence identity and structural similarity (i.e. 80% sequence identity, SSAP score ≥80) with a chain that has previously been chopped, the domain boundary assignment is performed automatically by inheriting the boundaries from the other chain (ChopClose). Otherwise, the domain boundaries are assigned manually, based on an analysis of results derived from a range of algorithms which include structure based methods [CATHEDRAL, SSAP, DETECTIVE (Swindells, 1995)], PUU (Holm and Sander, 1994), DOMAK (Siddiqui and Barton, 1995), sequence based methods (Profile HMMs) and relevant literature.

How to improve mammary transgene expression?

We suspect that it is not necessary to consider codon usage optimization when targeting a human gene to express in the animal mammary gland. So we investigated the codon usage pattern between mammary gland and other tissues. Significant differences were found between mammary gland with heart and pancreas tissues both in human and mouse. However, no significant correlation was found between expression levels and codon usage of recombinant human proteins expressed in the milk of transgenic animals. This may indicate that even though a human gene shares a similar codon usage with a mammary gland specific gene, especially milk protein genes, does not guarantee an effective expression in the mammary gland. In contrast, those proteins which share a similar domain with the four caseins are capable of achieving higher expression levels in animal mammary gland. We suppose that the tertiary structure of a recombinant protein adapts to the synthesis and secretion process of milk proteins especially caseins may permit it to be efficiently expressed in animal mammary gland.

In the secretory epithelial cell of mammary gland, milk protein precursors are assembled on the ribosomes of the highly developed rough endoplasmic reticulum (ER). All milk proteins have conserved secretory signal peptide sequences, which lead the growing nascent peptides to insert into the lumen of the ER. The proteins are then transported to the Golgi apparatus. The caseins then gradually intercalate with each other, calcium and phosphate to form a submicellar structures which lead to the formation of casein micelles, finally secreted by reverse pinocytosis (Farrell et al. 2006). Protein folding in the ER is monitored by ER quality control (ERQC) mechanisms. Proteins that pass ERQC criteria traffic to their final destinations through the secretory pathway, whereas non-native and unassembled subunits of multimeric proteins are degraded by the ER-associated degradation (ERAD) pathway (Vembar and Brodsky 2008). Overall, the environment of the ER lumen would be conducive to the proper casein-casein association, which helps them to escape ERAD and move onto the Golgi for processing and secretion. Thus the recombinant proteins as SERPINA1, SERPINC1, SERPING1 and BCHE, which have compact spherical structures composed of mainly alpha beta, 2-lalyer or 3-layer sandwich like domains, presumably form self-association or interact with caseins through the conserved structural elements, and escape the ERAD, and are efficiently transported to the Golgi apparatus to assemble micelles, and ultimately efficiently secreted into milk, lead to a high expression level in the transgenic milk. In contrast, AFP, CAT, FIX, IL2, NGF, and tPA all have a mainly alpha or beta barrel based domain which is quite different from casein (Supplementary Fig. 10). They may not interact with caseins properly and are aggregated to evoke the ERAD, thus fail to traffic to the Golgi apparatus, and ultimate lead to less efficient expression in the milk of transgenic animals.

Today it is still difficult to exactly predict the expression efficiency of one transgenic animal bioreactor. Because the expression efficiency of mammary transgene relies on a series of complex molecular regulation and cellular process. And it may be animal species dependent. Protein tertiary structure may be an important factor affecting the expression of animal mammary transgene. We doubt this is an absolute factor but, for example, for those proteins failing to secrete into transgenic milk even after transcriptional optimization, we might need to pay more attention to their tertiary structures. We may need to generate a group of elaborately designed gene targeting mice expressing structural distinct proteins under similar transcriptional regulation to verify this hypothesis. Furthermore, if for mammary transgenes encoding proteins that are structurally distinct from caseins such that they may pass through the ER/Golgi-dependent classical secretion pathway inefficiently, raises engineering the non-classical secretary pathway (He et al. 2008a, b; Nickel 2003) as an intriguing option.