Predicting relatedness of bacterial genomes using the chaperonin-60 universal target (cpn60 UT): Application to Thermoanaerobacter species

https://doi.org/10.1016/j.syapm.2010.11.019Get rights and content

Abstract

D.R. Zeigler determined that the sequence identity of bacterial genomes can be predicted accurately using the sequence identities of a corresponding set of genes that meet certain criteria [32]. This three-gene model for comparing bacterial genome pairs requires the determination of the sequence identities for recN, thdF, and rpoA. This involves the generation of approximately 4.2 kb of genomic DNA sequence from each organism to be compared, and also normally requires that oligonucleotide primers be designed for amplification and sequencing based on the sequences of closely related organisms. However, we have developed an analogous mathematical model for predicting the sequence identity of whole genomes based on the sequence identity of the 542–567 base pair chaperonin-60 universal target (cpn60 UT). The cpn60 UT is accessible in nearly all bacterial genomes with a single set of universal primers, and its length is such that it can be completely sequenced in one pair of overlapping sequencing reads via di-deoxy sequencing. These mathematical models were applied to a set of Thermoanaerobacter isolates from a wood chip compost pile and it was shown that both the one-gene cpn60 UT-based model and the three-gene model based on recN, rpoA, and thdF predicted that these isolates could be classified as Thermoanaerobacter thermohydrosulfuricus. Furthermore, it was found that the genomic prediction model using cpn60 UT gave similar results to whole-genome sequence alignments over a broad range of taxa, suggesting that this method may have general utility for screening isolates and predicting their taxonomic affiliations.

Introduction

The determination of the taxonomic identity of a bacterial isolate usually involves a combination of microscopy, phenotypic characterization, and DNA sequence analysis [8]. While the designation of a novel species requires whole genome comparisons based on DNA–DNA hybridization data [26] or, increasingly, by the comparison of whole or partial genome sequence data [22], less onerous approaches based on the unambiguous and highly reproducible determination of selected gene sequences can be viewed as an alternative for assigning an isolate to a pre-existing taxon or determining whether experimentation justifying the designation of a novel species is warranted [8], [16], [20]. The DNA sequence(s) to examine, however, is a matter of some choice. Although most isolates continue to be taxonomically classified based on the sequence of their 16S ribosomal RNA-encoding genes [4], [6], the limitations of this approach for discerning closely related bacteria are being increasingly recognized. Stackebrandt et al. [26] suggested that “an informative level of phylogenetic data would be obtained from the determination of a minimum of five genes under stabilizing selection for encoded metabolic functions.” Protein-encoding genes are known to provide higher levels of taxonomic resolution than non-protein-encoding genes [24] and classifications based on gyrB [29], recA [30], rpoB [2], [18], recN [3], [31], cpn60 [9] and other genes have been proposed. Furthermore, Loren et al. [15] recently determined that the sequences of certain protein-encoding housekeeping genes, including cpn60, can be used to predict with accuracy the genomic G + C content of species within the genus Aeromonas. A number of bacterial typing schemes based on the sequence comparisons of a number of protein-encoding genes from each isolate (multilocus sequence typing, or MLST) have been reported, but again the choice of genes to include in the MLST scheme is variable [5].

Zeigler [32] has reported the most comprehensive analysis to date of genes that are likely to be useful for predicting relatedness at the whole genome level. For this analysis, a computational algorithm was developed for determining whole genome sequence identities for 44 bacterial genomes distributed across 16 genera that were available at the time of analysis. The whole genome sequence identities corresponded well with available DNA–DNA hybridization data [32], and these sequence identity values were used to develop correlations and corresponding prediction models for individual genes and discrete sets of genes. These models thus facilitated the prediction of whole genome sequence identity for pairs of isolates based on the determination of the sequence identity of a set of genes. A scan of the genomic information for genes that are universal, lack paralogs and are phylogenetically informative led to the identification of a set of 32 candidate genes. These genes were then evaluated for their abilities to predict whole genome relatedness, and examination of all these genes led to the development of mathematical models for predicting the sequence identity of pairs of genomes. The single-gene model that performed best (i.e. had the highest correlation between gene and genome sequence identities) was recN, while the best model overall involved three genes: recN, rpoA, and thdF [32]. The gene that had the lowest correlation of sequence identity to genome sequence identity was the 16S rRNA-encoding gene.

The chaperonin-60 gene (cpn60, or groEL in E. coli) is an approximately 1650 base pair (bp) gene whose product functions to chaperone protein folding in prokaryotes and eukaryotes [11]. It is present in virtually all Bacteria and a fragment of this gene, the 549–567 bp (183–189 amino acid) cpn60 universal target (cpn60 UT), is accessible from any isolate or from a microbial community [10] with a set of universal amplification primers. Furthermore, the sequence that is amplified is highly distinct among organisms and an extensive database of cpn60 UT sequence information is available [12]. Since cpn60 UT sequences are so distinct, short enough to be sequenced in a single reaction, and accessible using a universal set of amplification primers, we set out to determine how well cpn60 UT sequence identities correlated to the genome sequence identities reported by Zeigler [32]. We also compared predictions of genome sequence identities using cpn60 UT to the results determined by whole-genome alignments of a broad range of taxa, which is an emerging gold standard for species identity [22].

Section snippets

Bacterial strains

Thermoanaerobacter brockii subsp. brockii DSM 1457T, Thermoanaerobacter pseudethanolicus DSM 2355T, and Thermoanaerobacter thermohydrosulfuricus DSM 567T were purchased from the Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ), Braunschweig, Germany. Isolates of Thermoanaerobacter spp. were collected from a decaying wood compost sample in Arundel, PQ, Canada. Briefly, working inside a Forma Scientific anaerobic chamber, ∼10 g of woodchips were transferred to 500 mL sterile ATCC 1191

Results and discussion

The genomic prediction model described by Zeigler [32] is very useful, elegant and provides a strong mathematical rationale for the selection of target genes for determining the taxonomic identity of an unknown isolate. The three-gene model or the one-gene recN-based model has been successfully applied to a variety of isolates [14], [31]. However, it should be noted that, while the criteria upon which genes were selected included those that were long enough to be phylogenetically informative,

Acknowledgements

We thank Florian Labat for technical assistance and Sean Hemmingsen for critical comments on this manuscript. This work was funded by a Natural Sciences and Engineering Research Council (NSERC) Strategic Grant (STPGP 365076), Genome Canada, and by the Cellulosic Biofuels Network (Agriculture and Agri-Food Canada).

References (32)

  • D. Gevers et al.

    Re-evaluating prokaryotic species

    Nat. Rev. Microbiol.

    (2005)
  • O.O. Glazunova et al.

    Partial sequence comparison of the rpoB, sodA, groEL and gyrB genes within the genus Streptococcus

    Int. J. Syst. Evol. Microbiol.

    (2009)
  • S.H. Harvey et al.

    Structural maintenance of chromosomes (SMC) proteins, a family of conserved ATPases

    Genome Biol.

    (2002)
  • S.M. Hemmingsen et al.

    Homologous plant and bacterial proteins chaperone oligomeric protein assembly

    Nature

    (1988)
  • J.E. Hill et al.

    cpnDB: a chaperonin sequence database

    Genome Res.

    (2004)
  • J.E. Hill et al.

    Improved template representation in cpn60 polymerase chain reaction (PCR) product libraries generated from complex templates by application of a specific mixture of PCR primers

    Environ. Microbiol.

    (2005)
  • Cited by (0)

    Nucleotide sequences determined in this work are deposited in the GenBank database under the following accession numbers: for cpn60: HM623910 and HM623896–HM623907; for thdF: HQ153761–HQ153772; for recN: HQ153773–HQ153784; for rpoA: HQ153785–HQ153796.

    View full text