Main

An open question regarding horizontal gene transfer is how many genes or what proportion of genes in microbes have extrinsic origins. There are two controversial perspectives on this issue4. One argues that horizontal transfer is limited to a certain kind of gene, compared with vertical transmission5, whereas the other claims that all genes have undergone horizontal transfer6. It is now thought that prokaryotic genomes are composed of two functionally distinct types of genes7: (i) less transferable 'informational' genes involved in information processing in the cell, such as translation, transcription and replication; and (ii) frequently transferable 'operational' genes involved in metabolism and considered to have fewer functional constraints. It is possible that any gene, even rRNA genes8, can be transferred.

We applied the Bayesian method to analyze 116 prokaryotic complete genomes and found that 46,759 (14%) of the total 324,653 open reading frames (ORFs) were derived from recent horizontal transfers (Table 1). The average proportion of horizontally transferred genes per genome was 12% of all ORFs, ranging from 0.5% to 25% depending on prokaryotic lineage. The smallest proportion (0.5%) was observed in the endocellular symbiont Buchnera sp. APS, and other symbiotic or parasitic bacteria, such as Wigglesworthia brevipalpis, Chlamydia, Mycoplasma, Rickettsia and Borrelia burgdorferi, showed small proportions. The largest proportion (25.2%) was observed in the euryarchaeal Methanosarcina acetivorans. The differences in the proportions are possibly due to the evolutionary processes of these species and are consistent with previous studies (Supplementary Note online). In general, we found a positive correlation between the total number of ORFs in a genome and the proportion of horizontally transferred genes (Supplementary Fig. 1 online). But our method may be preferentially detecting recent transfer and missing ancient transfer. Therefore, the frequency of horizontal transfer may be underestimated, and more transfer events might have actually occurred. We evaluated the effectiveness of the Bayesian method by comparing our estimate with those from the reference method (Supplementary Table 1 and Supplementary Note online).

Table 1 Proportion of horizontally transferred genes in complete genomes

Although a single gene might have a low horizontal transfer index (HTI) purely by chance, it is unlikely that a large cluster of neighboring genes would all have low HTIs by chance. Therefore, such clusters are considered to be a single unit simultaneously inserted into the genome. In particular, it has been suggested that a number of pathogenicity genes were horizontally transferred as large clusters, called 'pathogenicity islands'9. To look for such clusters, we computed local densities of horizontally transferred genes using a simple window analysis of HTIs in the genome. We found a total of 1,357 possible clusters in the 116 genomes (Table 1). These corresponded to regions previously known as pathogenicity islands or to regions where sequence similarities were suggestive of virulence-related functions. These latter regions may be new pathogenicity islands. We detected 83 potential pathogenicity islands in 24 plant and animal pathogens (Supplementary Table 2 online).

An advantage of our method is that it can identify the donor species. Although phylogenetic tree reconstruction is generally believed to be a better method for donor identification, most horizontally transferred genes have few significant matches in databases (Supplementary Note online). Our method can be used as a complementary approach to the phylogenetic analysis. We defined the horizontal transfer donor index (HTDI) and predicted the donor species of the horizontally transferred genes. In the genome of Neisseria meningitidis strain MC58 (ref. 10), for example, the gene NMB0066 that encodes the rRNA adenine N-6-methyltransferase was previously suggested to be horizontally transferred10, and phylogenetic analysis showed that this gene originated from Staphylococcus plasmids (Fig. 1a). Our donor identification method using the models of both N. meningitidis and Staphylococcus aureus also indicated that Staphylococcus was a possible origin of NMB0066, showing the effectiveness of this method (Fig. 1b). We could not apply the phylogenetic analysis to the neighboring genes that were transferred simultaneously with NMB0066 because the databases lacked appropriate homologs. But the HTDIs of these neighboring genes also supported Staphylococcus as the origin (Fig. 1b). N. meningitidis has a highly variable genome, and frequent horizontal transfer between the Neisseria and Haemophilus genera has been suggested11,12. Here, we identified horizontally transferred genes of N. meningitidis originating from a Streptococcus lineage, as well as from Staphylococcus and Haemophilus origins (Fig. 1c,d). These donor-recipient relationships were also independently supported by the phylogenetic analysis that we carried out (data not shown). The results suggest that the N. meningitidis genome has a mosaic structure composed of genes derived from multiple origins.

Figure 1: Donor identification of horizontally transferred genes in Neisseria meningitidis.
figure 1

(a) Molecular phylogenetic tree of the N. meningitidis MC58 gene NMB0066. (bd) HTIs and HTDIs of NMB0066, NMB1268, NMB1979 and their 15 surrounding genes. Black lines represent the HTIs calculated using the N. meningitidis MC58 model itself. Colored lines represent the HTDIs obtained using the N. meningitidis and donor candidate (S. aureus (b), S. pneumoniae (c) and H. influenzae (d)) models.

A vehicle is needed to transfer genes efficiently between different species. It is thought that foreign DNAs are mainly transferred by means of plasmids or bacteriophages, as well as direct uptake by the host itself1,2,13. Hence, the Bayesian method may also detect the plasmid or bacteriophage origin of horizontally transferred genes in the host species. Therefore, we split the host genome sequences into two independent regions, horizontally transferred and nontransferred regions, according to our HTI results (Table 1), and constructed two separate training models (the HT and non-HT models). We then computed and compared the HTIs of genes encoded in plasmid and bacteriophage genomes using both models (Table 2). For most species, the HT model predicted the plasmid or phage genes more effectively than the non-HT model. These observations imply that in many cases, the horizontally transferred genes were initially inserted into plasmids or phages and then introgressed into the recipient species. For Borrelia burgdorferi plasmids, however, all indices were higher with the non-HT model than the HT model, implying that the genes have been settled in B. burgdorferi for a long time and that their nucleotide compositions became similar to those of the host chromosomes by amelioration14.

Table 2 HTIs of plasmid or bacteriophage genes obtained using the HT and non-HT models

We examined the proportion of horizontally transferred genes in different functional categories based on the definitions produced by The Institute of Genomic Research (TIGR)15. Four main functional categories had high proportions of horizontally transferred genes; 'plasmid, phage and transposon functions' (28.3%), 'cell envelope' (13.8%), 'regulatory functions' (11.0%) and 'cellular processes' (10.0%; Fig. 2a). We surveyed the categories with the highest and second highest fractions of horizontally transferred genes in a genome and found that these four categories were mainly represented in individual species (Table 1).

Figure 2: Proportion of horizontally transferred (HT) genes in each functional category.
figure 2

Roles in which the proportion of horizontally transferred genes was larger than 10% are filled, and rare roles (<1,000 genes for the main roles, and <200 genes for the subroles) were excluded. (a) 'Plasmid, phage, transposon functions', which is one of the main role categories, was originally two different roles, 'viral functions' and 'other category', in the TIGR database. Here these roles are united because 'other category' is composed of three subrole categories: 'plasmid functions', 'prophage functions' and 'transposon functions'. Likewise, the three main roles in the TIGR database, 'hypothetical proteins', 'unclassified' and 'unknown function' were united as 'unknown proteins'. (b) Proportion of horizontally transferred genes in each subrole of the three main roles, namely 'cell envelope', 'regulatory functions' and 'cellular processes'.

We then examined the subroles of three categories (cell envelope, regulatory functions and cellular processes; Fig. 2b). We omitted the plasmid-phage-transposon category, which had the highest proportion, because it contains genes related to mobile elements that can be transferred naturally between different species. Many genes belonging to the 'cell envelope' category were classified under 'surface structure' (namely fimbrial or pilus protein genes) or 'biosynthesis and degradation of surface polysaccharides and lipopolysaccharides'. Of the 'cellular processes' genes, pathogenicity-related genes (pathogenesis or toxin production or resistance), including genes responsible for antibiotic synthesis, had been subjected to frequent horizontal transfer, although 'DNA transformation' had the highest proportion of horizontally transferred genes in this category. Cell surface genes may also be involved in the pathogenicity-related functions, because cell surface genes might have contributed to defense against immunological responses from infected hosts16, and some horizontally transferred genes in the 'surface structures' subgroup related to pilus structure might be involved in virulence, as they enable microbes to attach to the host cells17. These pathogenicity-related genes comprised 19% of the horizontally transferred genes examined (Supplementary Table 3 online). The number of horizontally transferred genes in this group is significantly larger than in other groups (e.g., the subrole 'pathogenesis': P < 10−100 using the χ2 test), quantitatively indicating more frequent exchange among species of these genes than of others.

Many genes from the 'regulatory functions' category were involved in 'DNA interactions' and encode DNA binding proteins. Because these genes can promote or inhibit transcriptional regulation, their emergence through horizontal transfer might have altered the gene expression patterns in the recipient organism. Abundance of horizontally transferred genes having 'regulatory functions' was mainly observed in soil bacteria or gram-positive bacteria with low G+C content (Table 1 and Supplementary Table 4 online), marking the evolutionary feature of these genomes.

The category with the fifth highest proportion of horizontally transferred genes was 'DNA metabolism' (Fig. 2a), and this abundance was mainly due to the fraction in the subrole 'restriction/modification' (Supplementary Table 3 online). Genes of 'protein synthesis' (2.7%), the 'purines, pyrimidines, nucleosides and nucleotides' (2.0%) and 'amino acid biosynthesis' (1.7%), which have a pivotal role in information processing in the cell, had the lowest frequencies of horizontal transfer (Fig. 2a).

Because cell surface, DNA binding and pathogenicity-related genes are included among the operational genes, the abundance of horizontally transferred genes in these categories is consistent with previous reports7. But other operational genes, related to amino acid biosynthesis, biosynthesis of cofactors, energy metabolism, intermediary metabolism, fatty acid and phospholipid metabolism, and nucleotide biosynthesis, had low proportions of horizontally transferred genes, comparable to the proportions of informational genes involved in transcription and protein synthesis (Fig. 2a). This suggests that operational genes, previously considered generally transferable, should be further classified into two groups according to their transferability. The highly transferable genes (cell surface, DNA binding, pathogenicity-related genes) seem to have been fixed by natural selection after horizontal transfer according to their advantageous characteristics for survival in a variety of environments or in host organisms.

Our observation is limited to the recent events of horizontal transfer because of the sensitivity of our method. If we extend the time scale of the events, we may have to think that all genes have undergone horizontal transfer at least once. The approach described here will therefore provide the initial basis for quantitatively understanding the evolution of the prokaryotic genome from the viewpoint of horizontal gene transfer.

Methods

Complete genome sequences.

We retrieved the complete sequences of 116 prokaryote genomes, 363 plasmids and 149 bacteriophages from the DNA Data Bank of Japan, EMBL and GenBank databases as of 1 April 2003.

Detection algorithm (HTI).

We computed the posterior probability to distinguish between intrinsic and extrinsic genes. The posterior probability is a probability that a DNA fragment in a given window is a coding region. To calculate this probability, we computed nucleotide compositions of the coding and noncoding regions in a genome. Thus, an extrinsic DNA segment introgressed into the donor genome should, ideally, be distinguishable from the recipient genome sequences by the nucleotide composition, unless the donor and recipient species are close relatives with similar nucleotide composition. Anciently transferred genes may be indistinguishable, because the nucleotide composition of the horizontally transferred genes is ameliorated and is converging with that of the recipient genome by mutation pressure14. Thus, our method may preferentially detect recent horizontally transferred genes for which the amelioration process has not yet been completed.

The posterior probability that a nucleotide fragment F appears in the coding regions of the genome is given by Bayes theorem as follows18:

Here, P(CODm|F) is the posterior probability that F is the coding sequence of the mth reading frame, where m = 1 corresponds to the true frame. The prior probabilities, P(CODm) and P(NON), are assumed to be 1/12 and 1/2, respectively. This algorithm was originally developed for gene finding18. The conditional probabilities, P(F|CODm) and P(F|NON), are calculated from the Markov chain models of both coding and noncoding regions obtained from the entire genome. The data set of the Markov chain models is called the training model. We primarily extracted coding and noncoding sequences from the complete genome sequence according to the database annotations. tRNA and rRNA genes and annotated pseudogenes were excluded from this analysis. The parameters of the training models (initiation/transition probabilities composing of Markov chains) were estimated by computing nucleotide frequencies in coding regions (CODm) or noncoding regions (NON). That is, initiation probabilities are identical to frequencies of x-bp tuples, and transition probabilities are identical to conditional probabilities that, given a x-bp tuple, a base appears at the next position. The order of the Markov chains was set to five (x = 5) to avoid an overfitting of the parameters19.

Finally, for each gene in the genome, we computed an index defined as the average P(COD1|F) value using window analysis (here F is a window sequence of the query gene) and named it the HTI of the gene. The window size was 96 bp and slid on the gene sequence by a step of 12 bp. In general, the training model contains parameters derived from a query gene, possibly resulting in inflated HTI of the gene. Therefore, to cancel self-contribution of this gene, its parameters were subtracted from those of the training model in computation.

Statistical significance of the HTI.

Because previous studies questioned the accuracy of ab initio methods mainly due to the ambiguity of statistical significance20,21, we conducted a statistical test of the horizontal transfer detection method using Monte-Carlo simulation. We randomly generated 100 artificial coding fragments for each gene based on the nucleotide frequencies of P(F|COD1). Thus, when the total number of genes in a given genome is T, the HTIs of 100 × T artificial fragments are computed and their distribution is obtained. The length of each fragment corresponds to that of a real gene. The significance for horizontal transfer was examined using a one-tailed test at a significance level of 1%.

Correction of the horizontally transferred gene list using the model derived from highly expressed genes.

Statistical significance alone does not guarantee the precise detection of horizontal transfer events; there is a functional constraint that causes nucleotide biases as seen in highly expressed genes. For example, because ribosomal protein genes often have anomalous base compositions or codon usage biases to maintain high translation efficiency22,23, these genes might be detected as false positives. Therefore, we prepared a referential model to exclude these highly expressed genes. The model was constructed using the coding and noncoding sequences of ribosomal protein gene regions. The order of the Markov chains was three because of the limited number of sequences (50–60 ribosomal protein genes in a genome). The above-mentioned Monte-Carlo simulation was done as a statistical test.

Finally, genes that satisfied the following two criteria were regarded as horizontally transferred genes: (i) genes that have significantly low HTIs with the model of the species (P < 0.01) and (ii) genes that do not have significantly high HTIs with the referential model of the ribosomal protein genes (P < 0.05). The data set of the horizontally transferred genes detected in this study is available in the database of horizontal gene transfer (see URLs).

Identification of the donor species of horizontally transferred genes.

When a donor candidate was suggested by other information, such as a phylogenetic tree, the probability that the gene was derived from this donor was estimated as follows:

Here, CODr1 is the true reading frame of a recipient species, and CODd1 is the true reading frame of a donor species. In the equation, we compared the probabilities P(F|COD1) between the recipient and donor models and assumed that P(CODr1) = P(CODd1) = 1/2. We defined the HTDI as the average P(CODd1|F) using window analysis, in a similar way to the HTI.

Detection of horizontally transferred gene clusters.

Extrinsic gene regions such as pathogenicity islands are often inserted as large clusters into the genome9,24. To detect these clusters, we calculated the number of horizontally transferred gene candidates in a window of ten genes slid by one gene over the genome and identified the regions in which the proportion of horizontally transferred gene candidates was greater than 40%. Both ends of the clusters were manually corrected, and then several clusters were joined, particularly when they seemed to be consecutively located in the genome.

Functional annotation of the horizontally transferred genes.

Using the horizontal gene transfer data sets that we obtained, we assigned biological roles to the horizontally transferred gene candidates. We used 90 genomes whose gene functions have been classified by the Comprehensive Microbial Resource in TIGR15. Of the 33,177 horizontally transferred genes obtained from these 90 genomes, 28,278 were categorized into higher orders of 'main role' and lower orders of 'subrole' according to the TIGR annotations.

Phylogenetic analysis.

We used the FASTA program to search the DNA Data Bank of Japan DAD 21, Swiss-Prot 40 and PIR 72 protein databases for homologs of the horizontally transferred genes (E < 10−8; ref. 25), aligned homologous sequences using CLUSTAL W26 and reconstructed phylogenetic trees using the neighbor-joining method, excluding gaps with Kimura's distance correction27.

URLs.

Genome sequences came from the DNA Data Bank of Japan, EMBL and GenBank databases, available at http://gib.genes.nig.ac.jp/, http://www.ebi.ac.uk/genomes/ and http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome, respectively. The horizontal gene transfer data sets are available at http://poplar.genes.nig.ac.jp/~hgt/viewer_top.cgi.

Note: Supplementary information is available on the Nature Genetics website.