Research paperAutomatic identification of large collections of protein-coding or rRNA sequences
Introduction
Identification is used in many fields, such as microbiology, medicine and environment. Sequence identification consists in the attribution of an unknown taxonomic unit to a taxonomic group of a pre-established classification. Thus, to identify a new taxon or a new sequence, it is necessary to find its nearest known taxon. In the medical field, methods of identification are used to detect and recognize micro-organisms implied in pathologies, which thus helps choosing the most suitable treatment. Identification can also be used in the agro-alimentary field as tools for food traceability. In other contexts such as identification of species or taxons from environmental organism molecular markers, the confrontation of a new sequence with a database, or sequence database update, the assignment of a new sequence to a collection is necessary. The number of available biological sequences increasing considerably with the development of massive sequencing techniques, it is necessary to rapidly classify these sequences into existing databases.
According to analyzed data, the approach used for identification differs and several tools exist. Identification tools often vary with the type of sequences and thus with the sequence databases for which they were developed. Several tools exist to make sequence identification and most of them are domain specific or data specific. For instance, some applications allow bacterial identification as BIBI (Bioinformatic Bacterial Identification) [1], PhyID/CD [2] or MicroSeq (Microbial identification System); others are specialized for the medical domain as RIDOM (Ribosomal Differentiation Of Medical Organisms) [3] or for the identification of Ribosomal RNA sequences as the RDP classifier (Ribosomal Database Project) [4], or TaxI [5] based on DNA barcodes.
We are interested in the homologous gene family databases HOVERGEN and HOGENOM [6] developed in our group. In these databases, homologous sequences are clustered into families, i.e., sequences of the same family share a common ancestor. Sequence alignments and phylogenetic trees for each family are also stored in these databases. Thus, these databases can be used for different purposes, among which phylogenetic analyses, and they allow the study of sequence evolutionary relationships. In order to build these family databases, several complex automated procedures are needed (similarity search, gene clustering, multiple alignment and tree computations). With the very fast growth of biological data, gene family database updates are time-consuming and tedious. Moreover, the addition of a single sequence to a given family from these databases can have many repercussions on the topology of the associated phylogenetic tree; these changes may be located near the introduced sequence, but they may also be located in deep nodes. In such case, the phylogenetic information brought by the whole family should be taken into account. Also, as HOVERGEN and HOGENOM contain large families, with several thousand sequences, powerful algorithms are required in order to manage large amount of sequences. Available identification tools, such as those presented previously, are developed to treat specific data and cannot be used effectively with large family databases. Thus, it is necessary to develop methods and bioinformatics tools (i) to carry out identification processes in a precise and rapid way, and (ii) to quickly add sequences to these databases without integrally updating them.
Section snippets
Two applications adapted to homologous gene family databases: HoSeqI and MultiHoSeqI
We have developed an application, HoSeqI (Homologous Sequence Identification), and another derived from the first, MultiHoSeqI. HoSeqI [7] is a Web application (http://pbil.univ-lyon1.fr/software/HoSeqI/) that allows to automatically identify sequences in large gene family databases. The identification process of an unknown sequence into these databases consists in (i) finding the homologous gene family to which this sequence belongs, using similarity search, (ii) aligning the analyzed sequence
Use of MultiHoSeqI with sequences of bacterial genus Frankia
MultiHoSeqI has been used to add genes from several collections of protein sequences to the databases developed by the PBIL (Pôle BioInformatique Lyonnais): putative protein sequences from metagenomes and from completely sequenced bacterial genomes. In collaboration with Philippe Normand (Laboratory of Soil Microbial Ecology, University of Lyon), Vincent Daubin and Simon Penel (Laboratory of Biometry and Evolutionary Biology, University of Lyon), this application was used to add predicted
An application adapted to 16S ribosomal RNA sequence databases: ChiSeqI
We are also interested in 16S ribosomal RNA databases, such as the American database, RDP [11] or the European database, Ribosomal RNA Database [12]. These databases contain 16S ribosomal RNA (rRNA) sequences which are commonly used for bacterial identification because these molecules are ubiquitous, abundant in cells and having a conserved structure. When sequences come from PCR amplification, chimeras, i.e. artifactual sequences produced by the experimental protocol and composed of several
Conclusion
We have presented here three applications allowing rapid and automatic identification of genomic sequences. Firstly, HoSeqI and MultiHoSeqI are adapted to homologous sequence databases. Via a Web interface, HoSeqI determines homologous gene families to which the series of query sequences belong and proposes to visualize alignments and phylogenetic trees of these families, including analyzed sequences. HoSeqI thus contributes to the study of the evolutionary background of new sequences.
References (27)
- et al.
BIBI, a bioinformatic bacterial identification tool
J. Clin. Microbiol.
(2003) - et al.
Génération et visualisation de la phylogénie des Bacteria pour l'étude des incohérences taxinomie-phylogénie
- et al.
RIDOM: comprehensive and public sequence database for identification of Mycobacterium species
BMC Infect. Dis.
(2003) - et al.
The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis
Nucleic Acids Res.
(2005) - et al.
TaxI: a software tool for DNA barcoding using distance methods
Philos. Trans. R. Soc. Lond. B Biol. Sci.
(2005) - et al.
HOVERGEN: database and software for comparative analysis of homologous vertebrate genes
- et al.
HoSeqI: automated homologous sequence identification in gene family databases
Bioinformatics
(2006) - et al.
HOBACGEN: database system for comparative genomics in bacteria
Genome Res.
(2000) - et al.
Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases
Bioinformatics
(2005) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis
Mol. Biol. Evol.
(2000)