doi:10.1006/jmbi.2001.5102
Copyright © 2001 Academic Press. All rights reserved.
Regular article
Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles1
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Daniel Gautheret
,
, 1 and André Lambert2
1 Centre d’Immunologie de Marseille Luminy, CNRS UMR 6102/INSERM U 136 Luminy Case 906 13288, Marseille Cedex 09, France
2 Centre de Physique Théorique CNRS UPR 7061, Luminy Case 907, 13288, Marseille Cedex 9, France
Received 17 May 2001;
revised 17 September 2001;
accepted 18 September 2001. ;
Available online 26 February 2002.
Abstract
We present here a new approach to the problem of defining RNA signatures and finding their occurrences in sequence databases. The proposed method is based on “secondary structure profiles”. An RNA sequence alignment with secondary structure information is used as an input. Two types of weight matrices/profiles are constructed from this alignment: single strands are represented by a classical lod-scores profile while helical regions are represented by an extended “helical profile” comprising 16 lod-scores per position, one for each of the 16 possible base-pairs. Database searches are then conducted using a simultaneous search for helical profiles and dynamic programming alignment of single strand profiles. The algorithm has been implemented into a new software, ERPIN, that performs both profile construction and database search. Applications are presented for several RNA motifs. The automated use of sequence information in both single-stranded and helical regions yields better sensitivity/specificity ratios than descriptor-based programs. Furthermore, since the translation of alignments into profiles is straightforward with ERPIN, iterative searches can easily be conducted to enrich collections of homologous RNAs.
Author Keywords: RNA motifs; sequence alignment; secondary structure; motif search; profiles
Figure 1. Main steps of the Secondary Structure Profile matching algorithm. (a) The training set consists of aligned RNA sequences and their secondary structure annotation (here a hairpin structure). (b) From each helix and single-strand in the training set, a profile is constructed. For a helix of size n, the profile has 16 rows and n columns. For a single-strand of size n, the profile has five rows and n columns. (c) Search procedure for the hairpin structure (see the text). The square matrix is the Dynamic Programming matrix constructed for the sequence-to-profile alignment. An example of traceback is shown by red arrows.
Figure 2. Basic structural elements defined in the ERPIN algorithm, for which all optimal scoring solutions are found. (a) Hairpin; (b) single-strand; (c) all possible combinations of two helices.
Figure 3. Study of the 301–337 region of 23 S rRNA. (a) Base conservation in this region among bacterial sequences, as provided by Robin Gutell (http://www.rna.icmb.utexas.edu/); capital letters:>95 % conserved; lowercase: 90–95 % conserved; filled circle: 80–90 % conserved; open circle: <80 % conserved; (b) solutions with scores higher than 10 in
the E. coli genome for the 301–337 motif. Scores are plotted in the order of appearance, in the positive strand and reverse-complement.
Figure 4. Consensus secondary structure and sequence conservation in the SECIS element, as used by Lescure
et al[
18].
Figure 5. Results of an iterative search for IRE elements in the UTR database (see explanation in the text).
Table 1. tRNA motif searches

Table 2. SECIS motif searches

Corresponding author
1 Edited by J. Doudna