Main

P. aeruginosa is a versatile Gram-negative bacterium that grows in soil, marshes and coastal marine habitats, as well as on plant and animal tissues1. It forms biofilms on wet surfaces such as those of rocks and soil2,3. The emergence of P. aeruginosa as a major opportunistic human pathogen during the past century may be a consequence of its resistance to the antibiotics and disinfectants that eliminate other environmental bacteria. P. aeruginosa is now a significant source of bacteraemia in burn victims, urinary-tract infections in catheterized patients, and hospital-acquired pneumonia in patients on respirators4. It is also the predominant cause of morbidity and mortality in cystic fibrosis patients, whose abnormal airway epithelia allow long-term colonization of the lungs by P. aeruginosa. These infections are impossible to eradicate, in part because of the natural resistance of the bacterium to antibiotics, and ultimately lead to pulmonary failure and death.

Here we report the sequencing of the genome of P. aeruginosa. The sequence is of interest because of the insights it provides into the role of this bacterium as a pathogen, and because it offers new information on the relationship between genome size, genetic complexity and ecological versatility in bacteria. At 6.3 million base pairs (Mbp), the P. aeruginosa genome is markedly larger than most of the 25 sequenced bacterial genomes. In fact, with 5,570 predicted open reading frames (ORFs), the genetic complexity of P. aeruginosa approaches that of the simple eukaryote Saccharomyces cerevisiae, whose genome encodes about 6,200 proteins5. In contrast, P. aeruginosa has only 30–40% of the number of predicted genes present in the simple metazoans Caenorhabditis elegans and Drosophila melanogaster6.

Sequencing and assembly

Sequencing of the complete 6.3-Mbp genome of P. aeruginosa was accomplished by a straightforward implementation of whole-genome-shotgun sampling. The largest genome that has been completely sequenced by this approach is from Deinococcus radiodurans (2.6 Mbp)7. The other large bacterial genome sequences (Bacillus subtilis, 4.2 Mbp; Synechocystis, 3.6 Mbp; Escherichia coli, 4.6 Mbp; and Mycobacterium tuberculosis , 4.4 Mbp)8,9,10,11 were all initially determined by sequencing overlapping sets of clones, polymerase chain reaction (PCR) products and gel-purified restriction fragments.

With one major exception, the assembled genome sequence is in excellent agreement with the physical map of the P. aeruginosa genome12,13. The exception is the inversion of more than one-quarter of the genome in the PAO1 isolate we sequenced, relative to DSM-1707, the PAO1-derived isolate previously mapped in the laboratory of B. Tümmler12,13 (Fig. 1). As both of these isolates are clonally derived from the original PAO1 strain of P. aeruginosa, any differences in genome structure between PAO1 and DSM-1707 must have arisen during propagation. This inversion does not appear to be unique to the sequenced isolate as PAO1 stocks from other laboratories have the same inversion (H. Schweizer, personal communication). The inversion appears to have resulted from homologous recombination between the rrnA and rrnB loci, which are orientated in opposite directions and separated by 1.7 Mbp (Fig. 1). Comparative analysis of digests of PAO1 and DSM-1707 with SfoI, SwaI and PacI supported this possibility. Earlier observations of inversions of genomic segments between oppositely orientated ribosomal DNA loci in E. coli and S. typhimurium led to the proposal that these reversible genome rearrangements may have adaptive significance14.

Figure 1: Circular representation of the P. aeruginosa genome.
figure 1

The outermost circle indicates the chromosomal location in base pairs (each tick is 100 kb). The distribution of genes is depicted by coloured boxes according to functional category and direction of transcription (outer band is the plus strand; inner band is the minus strand). Red arrows, the locations and direction of transcription of ribosomal RNA genes; green arrow, the inverted region that resulted from a homologous recombination event between rrnA and rrnB; blue arrows, location of two regions containing probable bacteriophages. The black plot in the centre is percentage G+C content plotted as the average for non-overlapping 1-kb windows spanning one strand for the entire P. aeruginosa genome. Yellow bars, regions of ≥ 3.0 kb with G+C content of two standard deviations (< 58.8%) below the mean (66.6%) (see Supplementary Information). A linear map of the genes, with the colour code for functional categories, is available at http://www.pseudomonas.com.

Properties of the genome and relationship to other bacteria

Basic features.

These are summarized in Table 1 and Fig. 1. Most of the predicted ORFs have the high G+C content (66.6%) characteristic of the genome as a whole and have codon usage similar to previously described P. aeruginosa genes. However, ten regions of 3.0 kilobases (kb) or greater exhibit significantly lower G+C content and unusual codon usage (Fig. 1), possibly indicative of recent horizontal transfer. In addition, there are two regions (PA616–PA648, PA715–PA728) containing probable bacteriophages.

Table 1 Genome features

Comparative analysis.

To gain insight into the significance of the size of this genome, we concentrated on comparative analysis of the P. aeruginosa and E. coli genomes. Not only is E. coli the most intensively studied of all bacteria, but it is also the closest relative of P. aeruginosa among the bacteria with fully sequenced genomes: for example, when each of the 5,570 predicted ORFs for P. aeruginosa was compared with the pooled ORFs for 22 other bacterial genomes by the sequence-alignment algorithm BLASTP44, nearly half of the best hits above a stringent comparison threshold (an expect value of 10-5) were to E. coli, and no other organism accounts for even 10% (see Supplementary Information). The relative prominence of E. coli increases moderately at more stringent thresholds although a noisy pattern of weak ‘best hits’ to phylogenetically distant bacteria emerges at lower thresholds. Although this test confirms that E. coli is a sensible comparison partner for P. aeruginosa, the median amino-acid identity within the aligned region of the P. aeruginosa–E. coli orthologues is only 40%.

Comparison of the P. aeruginosa and E. coli genomes indicates that the large genome of P. aeruginosa is the result of greater genetic complexity rather than differences in genome organization. Distributions of ORF sizes and inter-ORF spacings are both nearly identical in the two genomes (see http://www.pseudomonas.com), and the extent of evolutionarily recent duplications appears comparable. The longest repeats in the P. aeruginosa genome are the four rDNA loci and one duplicated gene cluster that spans a few thousand base pairs (PA1899–PA1905; PA4210–PA4216). At the level of amino-acid sequence conservation, residual stretches of locally conserved gene order between P. aeruginosa and E. coli are far more evident than are internal duplications in either genome. At the same BLASTP threshold employed for the comparisons between the P. aeruginosa ORFs and those of other bacterial genomes, we searched for clusters of five or more ORFs that are conserved between P. aeruginosa and E. coli, allowing single-ORF insertions or deletions within the clusters. Thirty-three distinct clusters were identified, which included 256 ORFs; seven of these clusters involve ten or more ORFs. This analysis showed only a few gene clusters duplicated within either the E. coli or the P. aeruginosa genome. Hence, with respect to local gene order, evidence of the common ancestry of segments of the P. aeruginosa and E. coli genome is far more abundant than are the vestiges of more recent duplication events of comparable size within either genome.

Evidence for increased functional diversity.

The apparent lack of recent gene duplication indicates that the size of the P. aeruginosa genome is due to greater gene and functional diversity. When we analysed the P. aeruginosa, E. coli, B. subtilis and M. tuberculosis genomes by BLASTP comparisons between all predicted ORFs within each organism, we found that the P. aeruginosa genome has significantly more distinct gene families (paralogous groups) than the other large bacterial genomes (see Supplementary Information ). There are nearly 50% more paralogous groups in P. aeruginosa than predicted on the basis of a simple comparison of genome sizes.

If the larger genome of P. aeruginosa arose by recent gene duplication, we would have expected it to have a similar number of paralogous groups to the other large bacterial genomes, with a larger number of ORFs in each group. In fact, the number of ORFs in the paralogous groups in PAO1 is similar to the other genomes. These data indicate that selection for environmental versatility has favoured expansion of genetic capability through the development of numerous small paralogous gene families whose members encode distinct functions.

Annotation of P. aeruginosa genes

Prediction of gene function.

The predicted ORFs were examined individually for (1) identity with known genes of P. aeruginosa with sequences deposited in GenBank, (2) similarity with well-characterized genes from other bacteria, or (3) presence of known functional motifs (Table 1; see http://www.pseudomonas.com for complete list). In each case the literature was searched to ensure that the proteins encoded by the homologous genes were functionally characterized to avoid the perpetuation of poorly supported functional assignments. In addition, 61 researchers who were members of the P. aeruginosa research community or had experience in particular aspects of bacterial physiology were enlisted for the Pseudomonas Community Annotation Project (PseudoCAP) to provide expert assistance and confirmatory information for the identification of ORFs and assigned functions.

We were able to assign a functional class to 54.2% of ORFs ( Table 2). As in other bacterial genomes, a large proportion of the genome (45.8% of ORFs) consists of genes for which no function could be determined or proposed (confidence level 4; see Table 1). Of these, nearly a third (769 ORFs) possess homology to genes of unknown function predicted in other bacterial genomes, and the remainder (32% of ORFs) does not have strong homology with any reported sequence.

Table 2 Functional classes of predicted genes

The 372 ORFs that are known P. aeruginosa genes with demonstrated functions (confidence level 1) are primarily genes encoding lipopolysaccharide biosynthetic enzymes, virulence factors, such as exoenzymes and the systems that secrete them, and proteins involved in motility and adhesion. ORFs with strong homology to genes in other organisms with demonstrated functions (confidence level 2; 1,059 ORFs) include those required for DNA replication, protein synthesis, cell-wall biosynthesis and intermediary metabolism. P. aeruginosa is able to grow on minimal medium, and as we expected, we identified most of the genes required for biosynthesis of amino acids, nucleic acids and cofactors.

The ORFs that provided the most new information about P. aeruginosa biology are those that could be assigned a probable function on the basis of similarity to established sequence motifs, but could not be assigned a definite name (confidence level 3; 1,590 ORFs). Most of these genes encode products that are in one of three functional classes: putative enzymes (405 genes), transcriptional regulators (341 genes) or transporters of small molecules (408 genes). In some cases genomic context provided additional information, allowing us to identify loci that appear to encode systems such as metabolic pathways and secretion systems, although the substrates for such systems could not be identified. These and other features of the P. aeruginosa genome that may shed light on its biology are discussed below. Additional details are available in the Supplementary Information and at http://www.pseudomonas.com.

Regulation.

P. aeruginosa has the highest proportion of predicted regulatory genes observed in the sequenced bacterial genomes. Analysis using relevant Pfam 5.2 family models15 and HMMER 2.1.1 (http://hmmer.wustl.edu/) shows 468 genes containing motifs characteristic of transcriptional regulators or environmental sensors (see Supplementary Information). This analysis predicts that 8.4% of P. aeruginosa genes are involved in regulation, a far higher proportion than is found in other sequenced genomes. (Manual annotation of the genome identified 521 genes (9.4%) as encoding either transcriptional regulators or two-component regulatory system proteins (Table 2). Thus the parameters we employed gave somewhat conservative predictions.)

Similar computational analysis of regulatory motifs in 22 genomes indicates that as bacterial genome size increases, the proportion of the genome devoted to regulatory proteins increases as well (Fig. 2). This trend appears most prominent in prototrophic bacteria that can survive in diverse environments. For example, motifs characteristic of regulatory proteins are found in 5.8% of E. coli genes and 5.3% of B. subtilis genes, but only in 3.0% of genes in M. tuberculosis, a highly specialized pathogen with a comparable genome size. Helicobacter pylori , another highly specialized bacterial pathogen with a much smaller genome, possesses even less regulatory potential (1.1% of genes). When we compared P. aeruginosa transcriptional regulators with other bacterial systems, the most striking over-representations in were in the LysR, AraC, ECF-σ and two-component regulator families. There is an extraordinary number of putative two-component regulatory system proteins, with 55 sensors, 89 response regulators and 14 sensor–response regulator hybrids, far more than found in the other genomes analysed. Such systems permit organisms to respond to changes in their environment, and are often associated with global regulatory systems as well as with regulation of virulence.

Figure 2: The percentage of genes with regulatory motifs increases with the size of the genome.
figure 2

Each sensor or regulatory family model was extracted from the Pfam 5.2 database15 and analysed against a database containing the combined predicted ORFs of each of the 22 genome sequences listed below. For each genome, the total number of ORFs identified with a probability of less than 10-4 as containing any of the regulatory motifs was divided by the number of predicted ORFs in that genome to calculate the percentage regulatory genes. The genomes analysed were M. genitalium (480 predicted ORFs), M. pneumoniae (677), R. prowazekii (834), B. burgdorferi (850), C. trachomatis (894), T. pallidum (1031), C. pneumoniae (1052), A. aeolicus (1522), H. pylori 26695 (1553), H. influenzae (1709), M. jannaschii (1715), P. abyssi (1765), T. maritima (1846), M. thermoautotrophicum (1869), N. meningitidis MC58 (2025), P. horikoshii (2064), A. fulgidus (2407), Synechocystis PCC6803 (3169), M. tuberculosis (3918), B. subtilis (4100), E. coli (4289), P. aeruginosa (5570).

Outer membrane proteins.

Outer membrane proteins (OMPs) are of particular interest in P. aeruginosa due to their cell-surface exposure and their involvement in transport of antibiotics, in export of extracellular virulence factors, and in anchoring the structures that mediate adhesion and motility. About 150 genes are predicted to encode OMPs, a disproportionately large number compared with other genomes. Three large paralogous families were identified: the OprD family of specific porins (19 genes), the TonB-family of gated porins, which includes proteins involved in iron-siderophore uptake (34 genes), and the OprM family of outer membrane proteins involved in efflux or secretion (18 genes). These large families of proteins were unexpected, as single members of these families (for example, OprD) had been well studied with no appreciation that these proteins were members of a large paralogous group. To date, the only other genome that is known to contain a large paralogous family of OMPs is H. pylori16. The identification of these families could have a significant impact on the focus of antimicrobial and vaccine research.

Import of nutrients.

Consistent with its environmental versatility, P. aeruginosa has nearly 300 cytoplasmic membrane transport systems, about two-thirds of which appear to be involved in the import of nutrients and other molecules (http://www-biology.ucsd.edu/ipaulsen/transport). The overall substrate specificities of the P. aeruginosa transporters are similar to those of E. coli and B. subtilis with certain significant exceptions (see Supplementary Information). P. aeruginosa has a large variety of transporters for mono-, di-, and tri-carboxylates, but it appears to be conspicuously deficient in sugar transporters. For example, it possesses four dicarboxylate permeases of the TRAP-T type (E. coli has only one), and has only two phosphotransferase system (PTS) sugar transporters—for fructose and N-acetylglucosamine (E. coli has more than twenty)17. Also, P. aeruginosa has no predicted sugar transporters of the major facilitator superfamily (MFS), although E. coli has more than twenty. The apparent lack of sugar transporters in P. aeruginosa correlates with the absence of an intact glycolytic pathway and with its aerobic, oxidative metabolism18.

β-Oxidative metabolism.

In contrast to its limited ability to grow on sugars, P. aeruginosa can use a wide variety of other carbon compounds, and its genome provides insight into the molecular basis of this metabolic versatility. In addition to known oxidative enzymes and pathways, we found a substantial number of other genes encoding putative enzymes characteristic of β-oxidation, such as acyl-CoA dehydrogenase (25 genes) and enoyl-CoA hydratase/isomerase (16 genes). In contrast, E. coli contains four genes for acyl-CoA dehydrogenase and seven for enoyl-CoA hydratase/isomerase. With the exception of M. tuberculosis, no other sequenced genome contains such large numbers of these enzymes. The β-oxidative genes are often clustered with other genes encoding proteins that may have related functions, such as probable acyl-CoA thiolases, short-chain dehydrogenases, flavin-containing monooxygenases, or other oxidoreductases. In several cases, these gene clusters also contained genes for MFS transport proteins and outer membrane porins of the OprD family (see Supplementary Information).

Intrinsic drug resistance and efflux systems.

P. aeruginosa is noted for its intrinsic resistance to many front-line antibiotics, due mainly to its low outer membrane permeability and to active efflux of antibiotics19. Four P. aeruginosa multidrug efflux systems have been reported, all of which are members of the resistance-nodulation-cell division (RND) family20,21. We used BLASTP analysis to identify potential export systems in the PAO1 genome, and probable multidrug efflux systems were identified by a phylogenetic analysis of each family17. The P. aeruginosa genome appears to contain a large number of undescribed drug efflux systems, predominantly of the RND and MFS families ( Fig. 3). The number of predicted drug efflux systems from the MFS, small multi-drug resistance (SMR), ATP-binding cassette (ABC) and multidrug and toxic compound extrusion (MATE) families is similar to other organisms such as E. coli, B. subtilis and M. tuberculosis. However, P. aeruginosa contains many more predicted AcrB/Mex-type RND multidrug efflux systems (10 genes) than E. coli (4), B. subtilis (1) and M. tuberculosis (0). Each of the P. aeruginosa genes encoding a putative RND transport protein is adjacent to a gene for a probable membrane fusion protein; most RND loci also contain genes for outer membrane proteins of the OprM family (see Supplementary Information).

Figure 3: Comparison of the number of predicted drug efflux systems in P.aeruginosa , E. coli, B. subtilis and M. tuberculosis.
figure 3

For the last three organisms, these numbers are based on predictions taken from http://www-biology.ucsd.edu/ipaulsen/transport/. Five types of multidrug efflux systems are analysed: resistance/nodulation/cell division family (RND; for example, E. coli AcrB), major facilitator superfamily (MFS; for example, B. subtilis Bmr), small multidrug resistance family (SMR; for example, E. coli EmrE), multidrug and toxic compound extrusion family (MATE; for example, Vibrio parahaemolyticus NorM)41 and ATP-binding cassette family (ABC; for example, Lactococcus lactis LmrA)17. Only family members that clearly clustered with known multidrug efflux systems were counted as probable multidrug efflux systems. For example, the number of RND multidrug efflux systems does not include members of this family that belong to the SecD/SecF protein excretion, the Czc metal efflux or the M. tuberculosis MmpL glycolipid efflux42 protein clusters, but only includes proteins belonging to the AcrB/Mex multidrug efflux protein cluster43.

Protein secretion.

P. aeruginosa secretes several virulence factors, including toxins, lipases and proteases. Four pathways of protein secretion have been described for Gram-negative bacteria22, and three of these were evident in P. aeruginosa. The prototypic type I system in P. aeruginosa , which directs secretion of alkaline protease (encoded by aprA), consists of the ABC transport protein AprD, the membrane fusion protein AprE and the OprM-family outer membrane protein AprF. The PAO1 genome appears to contain four additional type I systems. One of these clusters (PA3404–PA3408) is homologous to the haem acquisition system (Has) of Serratia marcescens. A sixth aprF homologue (PA4974) was not clustered with genes for other putative transport proteins. This gene was most similar in sequence to tolC, which encodes an E. coli outer membrane protein involved in haemolysin secretion.

A type II secretion system (general secretion pathway) is encoded by the xcp gene cluster (PA3095–PA3105) and the unlinked pilD/ xcpA gene (PA4528)23. An unexpected finding was that many of the xcp genes have homologues elsewhere on the chromosome, including four additional homologues of xcpQ and xcpR. Type III secretion systems are found in many plant and human pathogens and are responsible for contact-dependent delivery of proteins into the cytoplasm of host cells. P. aeruginosa contains a single type III secretion system (PA1690–PA1725) which secretes several proteins including exoenzymes S, T and Y24.

Other potential surface molecules.

The genome of P. aeruginosa PAO1 contains two extremely long open reading frames, PA2462 (5,628 amino acids) and PA41 (3,536 amino acids). Each of these ORFs has sequence similarity to proteins that are adhesins in other bacterial pathogens, the filamentous haemagglutinin (FhaB) of Bordetella pertussis, and the HMW1A / HMW2A adhesins of Haemophilus influenzae 25,26. PA2462 is adjacent to an ORF (PA2461) with strikingly abnormal codon usage and a G+C content of only 38.5%. Further, in addition to the genes known to be involved in synthesis of lipopolysaccharides, we noted three other genetic loci that may be involved in the synthesis of extracellular polysaccharides (see Supplementary Information). For example, a seven-gene cluster (PA1385–PA1391) adjacent to the galE gene includes genes for four probable glycosyl transferases and an ABC transport protein similar to putative carbohydrate-exporting proteins encoded in O-antigen biosynthetic gene clusters in other organisms. This seven-gene cluster is in a region with lower G+C content than the surrounding genes.

Chemosensing and chemotaxis.

P. aeruginosa appears to have the most complex chemosensory systems of all the complete bacterial genomes, with four loci that encode probable chemotaxis signal-transduction pathways (see Supplementary Information ). Of these, one (PA1456–PA1464) is similar in gene organization to the Salmonella typhimurium locus required for flagella-mediated swimming toward chemoattractants27. A second (PA173–PA180) more closely resembles the gene organization seen in Rhodobacter sphaeroides 28. Each of the two remaining clusters has homology to the che genes from E. coli and to the frz genes from the non-flagellated gliding bacterium, Myxococcus xanthus29. PA408–PA417 is required for twitching motility30 and PA3702–PA3708 is as yet uncharacterized. P. aeruginosa undergoes chemotaxis toward a variety of sugars, amino acids, and inorganic phosphate, and away from thiocyanic and isothiocyanic esters31,32,33,34. We identified 26 ORFs encoding probable chemotaxis sensory transducer proteins to mediate these responses.

Discussion

We propose that the large genome size and genetic complexity of P. aeruginosa reflect evolutionary adaptations permitting it to thrive in diverse ecological niches. Analysis of the complete genome sequence of P. aeruginosa reveals many clues regarding the basis of this versatility. P. aeruginosa has broad capabilities to transport, metabolize and grow on organic substances, numerous iron-siderophore uptake systems, and the enhanced ability to export compounds (for example, enzymes and antibiotics) by a large number of protein secretion and RND efflux systems. P. aeruginosa potentially possesses four chemotaxis systems, at least one of which contributes to its ability to form biofilms35. Thus this organism can readily move to more favourable conditions or consolidate and ‘dig in’ for persistent colonization of a particular microenvironment. Consistent with its increased genetic complexity, P. aeruginosa has the greatest percentage of genes devoted to command and control systems (for example, environmental sensors and transcriptional regulators) observed in a bacterial genome. These regulatory genes presumably modulate the diverse genetic and biochemical capabilities of this bacterium in changing environmental conditions.

P. aeruginosa infections are particularly difficult to treat because of intrinsic resistance to antibiotics. It would appear that, in the course of evolving the functional diversity required to compete with other microorganisms in a variety of environments, it developed mechanisms for resisting naturally occurring antimicrobial compounds. The efflux systems we identified could contribute to this intrinsic resistance. The effects of antimicrobials could also be mitigated by modulating expression of drug targets, enzymatic modifiers, transport systems and compensatory pathways. Indeed, the unusually large regulatory capability in P. aeruginosa may provide greater latitude for adaptive drug resistance through gene regulation than exists in other bacteria with smaller genomes. Furthermore, given its capacity to metabolize a wide variety of organic substrates, it is also possible that P. aeruginosa possesses greater potential for enzymatic modification and degradative drug resistance mechanisms than was thought. Therefore, the metabolic diversity, transport capabilities and regulatory adaptability that enable P. aeruginosa to thrive and compete with other microorganisms probably all contribute to its high intrinsic resistance to antibiotics. Knowledge of the complete genome sequence and encoded processes provides a wealth of information for the discovery and exploitation of new antibiotic targets, and hope for the development of more effective strategies to treat the life-threatening opportunistic infections caused by P. aeruginosa in humans.

Methods

Sequencing and assembly

Strain PAO1, a wound isolate36, was chosen as a strain prototype for sequencing because it is the most widely used P. aeruginosa laboratory strain and because physical and genetic maps were available12,13. Strain PAO1 was obtained from B. Holloway's collection maintained in the laboratory of P. Phibbs, Univ. of Georgia. Our approach to the P. aeruginosa sequencing relied on standard data-collection and sequence-assembly methods. Most of the data comprised individual sequencing traces acquired from randomly picked M13 templates. The data were processed with the phred/phrap/consed package of base-calling, sequence-assembly, and finishing/editing software (http://bozeman.mbt.washington.edu)37,38. At frequent intervals throughout the project, de novo phrap assemblies were carried out using all available data. The main requirement for achieving practical computation times was adequate real memory: on the full data set, the assembly required 4 h on a workstation with 4 gigabytes of memory. We employed dye-primer and dye-terminator chemistry in a 51:49 ratio to acquire 94,847 usable shotgun-sequencing traces. The shotgun traces provided 6.9-fold coverage of the genome in ‘high-quality’ base calls (that is, those with phred scores > 20, which corresponds to an error rate < 1%). The final raw-data set included an additional 1,604 ‘finishing’ traces, most of which were obtained by priming with custom primers on M13 templates selected from the random-template collection. The purpose of most ‘finishing’ reads was to improve data quality in regions where the phred/phrap/consed software recognized that the consensus sequence had inadequate support. We also acquired 1,672 cosmid-end sequences from a collection of 836 cosmids that contained 40-kb inserts. The inserts in these cosmids covered 97% of the genome; hence, their end sequences provided a strong check on the validity of the final assembly.

Only 13 sites in the genome required specialized finishing procedures: four of these sites are the rDNA loci, which contain nearly exact copies of a 5.9-kb repeat, six are nearly identical copies of a 1.4-kb insertion sequence, and the other three also involve repeated sequences. The sequences at all 13 of these sites were obtained by sequencing PCR products that spanned the individual repeats. In addition to validating the assembly with cosmid-end sequences, we also monitored the base-pair accuracy of the final sequence in a variety of ways. One test involved full sequencing, by conventional methods, of two cosmids that contained widely spaced segments of the P. aeruginosa genome; these cosmids, which were selected at the beginning of the project, were brought to the highest achievable quality standard by expert ‘finishers’. In the final whole-genome assembly, which was entirely independent of the cosmid sequencing, we found no discrepancies with the 81,843 base pairs present in the two cosmids.

Gene predictions

Open reading frames were initially predicted by GeneMark.HMM (http://genemark.biology.gatech.edu/GeneMark/whmm.cgi)39. Additional ORFs with homology to known genes were identified by BLASTX analysis. Predicted ORFs were reviewed individually by gene annotators for start-codon assignment based on additional contextual information such as the proximity of ribosomal binding sequence motifs and predicted signal peptides.