Elsevier

Biochimie

Volume 90, Issue 4, April 2008, Pages 609-614
Biochimie

Research paper
Automatic identification of large collections of protein-coding or rRNA sequences

https://doi.org/10.1016/j.biochi.2007.08.006Get rights and content

Abstract

The number of available genomic sequences is growing very fast, due to the development of massive sequencing techniques. Sequence identification is needed and contributes to the assessment of gene and species evolutionary relationships. Automated bioinformatics tools are thus necessary to carry out these identification operations in an accurate and fast way. We developed HoSeqI (Homologous Sequence Identification), a software environment allowing this kind of automated sequence identification using homologous gene family databases. HoSeqI is accessible through a Web interface (http://pbil.univ-lyon1.fr/software/HoSeqI/) allowing to identify one or several sequences and to visualize resulting alignments and phylogenetic trees. We also implemented another application, MultiHoSeqI, to quickly add a large set of sequences to a family database in order to identify them, to update the database, or to help automatic genome annotation. Lately, we developed an application, ChiSeqI (Chimeric Sequence Identification), to automate the processes of identification of bacterial 16S ribosomal RNA sequences and of detection of chimeric sequences.

Introduction

Identification is used in many fields, such as microbiology, medicine and environment. Sequence identification consists in the attribution of an unknown taxonomic unit to a taxonomic group of a pre-established classification. Thus, to identify a new taxon or a new sequence, it is necessary to find its nearest known taxon. In the medical field, methods of identification are used to detect and recognize micro-organisms implied in pathologies, which thus helps choosing the most suitable treatment. Identification can also be used in the agro-alimentary field as tools for food traceability. In other contexts such as identification of species or taxons from environmental organism molecular markers, the confrontation of a new sequence with a database, or sequence database update, the assignment of a new sequence to a collection is necessary. The number of available biological sequences increasing considerably with the development of massive sequencing techniques, it is necessary to rapidly classify these sequences into existing databases.

According to analyzed data, the approach used for identification differs and several tools exist. Identification tools often vary with the type of sequences and thus with the sequence databases for which they were developed. Several tools exist to make sequence identification and most of them are domain specific or data specific. For instance, some applications allow bacterial identification as BIBI (Bioinformatic Bacterial Identification) [1], PhyID/CD [2] or MicroSeq (Microbial identification System); others are specialized for the medical domain as RIDOM (Ribosomal Differentiation Of Medical Organisms) [3] or for the identification of Ribosomal RNA sequences as the RDP classifier (Ribosomal Database Project) [4], or TaxI [5] based on DNA barcodes.

We are interested in the homologous gene family databases HOVERGEN and HOGENOM [6] developed in our group. In these databases, homologous sequences are clustered into families, i.e., sequences of the same family share a common ancestor. Sequence alignments and phylogenetic trees for each family are also stored in these databases. Thus, these databases can be used for different purposes, among which phylogenetic analyses, and they allow the study of sequence evolutionary relationships. In order to build these family databases, several complex automated procedures are needed (similarity search, gene clustering, multiple alignment and tree computations). With the very fast growth of biological data, gene family database updates are time-consuming and tedious. Moreover, the addition of a single sequence to a given family from these databases can have many repercussions on the topology of the associated phylogenetic tree; these changes may be located near the introduced sequence, but they may also be located in deep nodes. In such case, the phylogenetic information brought by the whole family should be taken into account. Also, as HOVERGEN and HOGENOM contain large families, with several thousand sequences, powerful algorithms are required in order to manage large amount of sequences. Available identification tools, such as those presented previously, are developed to treat specific data and cannot be used effectively with large family databases. Thus, it is necessary to develop methods and bioinformatics tools (i) to carry out identification processes in a precise and rapid way, and (ii) to quickly add sequences to these databases without integrally updating them.

Section snippets

Two applications adapted to homologous gene family databases: HoSeqI and MultiHoSeqI

We have developed an application, HoSeqI (Homologous Sequence Identification), and another derived from the first, MultiHoSeqI. HoSeqI [7] is a Web application (http://pbil.univ-lyon1.fr/software/HoSeqI/) that allows to automatically identify sequences in large gene family databases. The identification process of an unknown sequence into these databases consists in (i) finding the homologous gene family to which this sequence belongs, using similarity search, (ii) aligning the analyzed sequence

Use of MultiHoSeqI with sequences of bacterial genus Frankia

MultiHoSeqI has been used to add genes from several collections of protein sequences to the databases developed by the PBIL (Pôle BioInformatique Lyonnais): putative protein sequences from metagenomes and from completely sequenced bacterial genomes. In collaboration with Philippe Normand (Laboratory of Soil Microbial Ecology, University of Lyon), Vincent Daubin and Simon Penel (Laboratory of Biometry and Evolutionary Biology, University of Lyon), this application was used to add predicted

An application adapted to 16S ribosomal RNA sequence databases: ChiSeqI

We are also interested in 16S ribosomal RNA databases, such as the American database, RDP [11] or the European database, Ribosomal RNA Database [12]. These databases contain 16S ribosomal RNA (rRNA) sequences which are commonly used for bacterial identification because these molecules are ubiquitous, abundant in cells and having a conserved structure. When sequences come from PCR amplification, chimeras, i.e. artifactual sequences produced by the experimental protocol and composed of several

Conclusion

We have presented here three applications allowing rapid and automatic identification of genomic sequences. Firstly, HoSeqI and MultiHoSeqI are adapted to homologous sequence databases. Via a Web interface, HoSeqI determines homologous gene families to which the series of query sequences belong and proposes to visualize alignments and phylogenetic trees of these families, including analyzed sequences. HoSeqI thus contributes to the study of the evolutionary background of new sequences.

References (27)

  • G. Devulder et al.

    BIBI, a bioinformatic bacterial identification tool

    J. Clin. Microbiol.

    (2003)
  • J.P. Flandrois et al.

    Génération et visualisation de la phylogénie des Bacteria pour l'étude des incohérences taxinomie-phylogénie

  • D. Harmsen et al.

    RIDOM: comprehensive and public sequence database for identification of Mycobacterium species

    BMC Infect. Dis.

    (2003)
  • J.R. Cole et al.

    The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis

    Nucleic Acids Res.

    (2005)
  • D. Steinke et al.

    TaxI: a software tool for DNA barcoding using distance methods

    Philos. Trans. R. Soc. Lond. B Biol. Sci.

    (2005)
  • L. Duret et al.

    HOVERGEN: database and software for comparative analysis of homologous vertebrate genes

  • A.M. Arigon et al.

    HoSeqI: automated homologous sequence identification in gene family databases

    Bioinformatics

    (2006)
  • G. Perrière et al.

    HOBACGEN: database system for comparative genomics in bacteria

    Genome Res.

    (2000)
  • J.F. Dufayard et al.

    Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases

    Bioinformatics

    (2005)
  • J. Castresana

    Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis

    Mol. Biol. Evol.

    (2000)
  • J.R. Cole et al.

    The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data

    Nucleic Acids Res.

    (2007)
  • J. Wuyts et al.

    The European ribosomal RNA database

    Nucleic Acids Res.

    (2004)
  • J.F. Robison-Cox et al.

    Evaluation of nearest-neighbor methods for detection of chimeric small-subunit rRNA sequences

    Appl. Environ. Microbiol.

    (1995)
  • Cited by (0)

    View full text