Regular article
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences1

https://doi.org/10.1006/jmbi.2001.5351Get rights and content

Abstract

With the rapid increase in the size of the genome sequence database, computational analysis of RNA will become increasingly important in revealing structure-function relationships and potential drug targets. RNA secondary structure prediction for a single sequence is 73 % accurate on average for a large database of known secondary structures. This level of accuracy provides a good starting point for determining a secondary structure either by comparative sequence analysis or by the interpretation of experimental studies. Dynalign is a new computer algorithm that improves the accuracy of structure prediction by combining free energy minimization and comparative sequence analysis to find a low free energy structure common to two sequences without requiring any sequence identity. It uses a dynamic programming construct suggested by Sankoff. Dynalign, however, restricts the maximum distance, M, allowed between aligned nucleotides in the two sequences. This makes the calculation tractable because the complexity is simplified to O(M3N3), where N is the length of the shorter sequence.

The accuracy of Dynalign was tested with sets of 13 tRNAs, seven 5 S rRNAs, and two R2 3′ UTR sequences. On average, Dynalign predicted 86.1 % of known base-pairs in the tRNAs, as compared to 59.7 % for free energy minimization alone. For the 5 S rRNAs, the average accuracy improves from 47.8 % to 86.4 %. The secondary structure of the R2 3′ UTR from Drosophila takahashii is poorly predicted by standard free energy minimization. With Dynalign, however, the structure predicted in tandem with the sequence from Drosophila melanogaster nearly matches the structure determined by comparative sequence analysis.

Introduction

The rapidly expanding databases of genome sequences provide a foundation for rapidly generating new databases of RNA secondary structures. These secondary structures are important for understanding structure-function relationships and choosing drug targets. Comparative sequence analysis is the gold standard for determination of RNA secondary structure in the absence of a structure solved by X-ray crystallography.1 Structures of large RNAs solved by X-ray crystallography have largely verified the base-pairs predicted by comparative sequence analysis.2, 3, 4, 5, 6, 7, 8 While only a small number of RNA structures have been determined by crystallography, many classes of RNAs have secondary structures determined by comparative sequence analysis. These include the small subunit rRNA,9 large subunit rRNA,10 5 S rRNA,11 group I intron,12 group II intron,13 RNAase P RNA,14 SRP RNA,15 tRNA,16 telomerase RNA,17, 18 and tmRNA.19

Comparative sequence analysis requires an alignment of a large number of sequences with identical function. When only one sequence is available, the secondary structure can be predicted on the basis of free energy minimization with an accuracy of roughly 73 % on average for sequences of less than 700 nucleotides.20, 21 Several algorithms are available for free energy minimization of RNA secondary structure.20, 21, 22, 23, 24, 25

Algorithms have also been developed to combine free energy minimization with comparative sequence analysis.26, 27, 28, 29, 30, 31, 32 The advantages of these programs are the improved accuracy of secondary structure prediction and automation of the laborious process of comparative sequence analysis.

Many of the algorithms that employ free energy minimization as a tool for comparative sequence analysis require a fixed sequence alignment as input.26, 27, 28, 31, 32 Alignments determined by sequence matching, however, are complicated by compensating base changes and the fact that most RNAs are composed of only four different nucleotides. The fixed alignment can be flawed and so may restrict the algorithms’ ability to find a conserved structure.

Algorithms that use free energy minimization to find a conserved structure without assuming a fixed alignment are more robust,30, 33 although they are generally more time consuming. Notredame et al.29 wrote a program that uses a genetic algorithm to find the structure of a sequence given a second, related sequence with known structure. Chen et al.30 developed a genetic algorithm that finds a conserved structure for a set of sequences without requiring a known structure.

Eddy and Durbin34 developed an approach to automate comparative sequence analysis that is not based on free energy minimization. They developed a covariance model that takes a set of unaligned RNA sequences and determines a sequence alignment and consensus structure with multiple rounds of refinement.34

Sankoff35 proposed that a dynamic programming algorithm could simultaneously solve the sequence alignment and folding problems for multiple sequences. Gorodkin et al.33 wrote the first practical algorithm of this type, FOLDALIGN, by utilizing three simplifications to speed the calculation. Firstly, the dynamic programming calculation is limited to predicting the structures for two sequences at a time. Secondly, the algorithm optimizes the number of base-pairs in the structures, rather than the free energies. Thirdly, multibranch loops are not allowed.

Here, a dynamic programming algorithm, called Dynalign, is presented that aligns two sequences and finds a common structure, including multibranch loops. Dynalign is based on the dynamic programming solution proposed by Sankoff35 and uses nearest-neighbor rules for predicting the free energies of secondary structures.20, 36, 37 When tested with tRNA, 5 S rRNA, and R2 3′ UTR RNAs, Dynalign improves the accuracy of secondary structure prediction relative to prediction for a single sequence by free energy minimization.

Section snippets

Algorithm

Dynalign is a dynamic programming algorithm that takes two sequences as input and then outputs a sequence alignment and a common structure for the two sequences. The sequence alignment indicates the nucleotides aligned in paired regions, but does not align exactly those nucleotides in unpaired regions. For the common structure, base-pairs are allowed only if both sequences can accommodate a canonical pair at the same position in the alignment. Dynalign minimizes the total free energy of the

Discussion

Determining RNA secondary structure is important for revealing structure-function relationships and designing oligonucleotides for antisense applications and gene chip arrays by identifying targetable regions and suggesting possible confounding structures.45, 46, 47, 48 The Dynalign algorithm takes advantage of both free energy minimization and comparative sequence analysis to predict RNA secondary structure. It can improve the accuracy of secondary structure prediction compared to standard

Dynalign algorithm

Dynalign is a four-dimensional dynamic programming algorithm and as such the calculation is divided into two steps. The fill step calculates three arrays of free energies, W(i,j,k,l), V(i,j,k,l), and W5(i,k). W(i,j,k,l) is the sum of the minimum free energies for nucleotide fragments i to j from the first sequence and k to l from the second sequence with i aligned to k and j aligned to l plus any gap penalties for interior nucleotides in the sequence alignment. V(i,j,k,l) is defined the same as

Acknowledgements

This work was supported by NIH grant GM22939. D.H.M. is a trainee in the medical scientist training program, NIH grant 5T32 GM07356

References (59)

  • R. Lück et al.

    Thermodynamic prediction of conserved secondary structureapplication to the RRE element of HIV, the tRNA-like element of CMV and the mRNA of prion protein

    J. Mol. Biol.

    (1996)
  • S.B. Needleman et al.

    A general method applicable to the search for similarities in the amino acid sequence of two proteins

    J. Mol. Biol.

    (1970)
  • G. Knapp

    Enzymatic approaches to probing RNA secondary and tertiary structure

    Methods Enzymol.

    (1989)
  • M. Zuker

    Suboptimal sequence alignment in molecular biology. Alignment with error analysis

    J. Mol. Biol.

    (1991)
  • T.F. Smith et al.

    Comparison of bio-sequences

    Advan. Appl. Math.

    (1981)
  • N.R. Pace et al.

    Probing RNA structure, function, and history by comparative analysis

  • J.H. Cate et al.

    Crystal structure of a group I ribozyme domainprinciples of RNA packing

    Science

    (1996)
  • N. Ban et al.

    The complete atomic structure of the large ribosomal subunit at 2.4 Å resolution

    Science

    (2000)
  • B.T. Wimberly et al.

    Structure of the 30S ribosomal subunit

    Nature

    (2000)
  • S.H. Kim et al.

    Three dimensional tertiary structure of yeast phenylalanine transfer RNA

    Science

    (1974)
  • J.D. Robertus et al.

    Structure of yeast phenylalanine tRNA at 3 Å resolution

    Nature

    (1974)
  • M.M. Yusupov et al.

    Crystal structure of the ribosome at 5.5 Å resolution

    Science

    (2001)
  • R.R. Gutell

    Collection of small subunit (16 S- and 16 S-like) ribosomal RNA structures

    Nucl. Acids Res.

    (1994)
  • M. Szymanski et al.

    5 S rRNA data bank

    Nucl. Acids Res.

    (1998)
  • S.H. Damberger et al.

    A comparative database of group I intron structures

    Nucl. Acids Res.

    (1994)
  • J.W. Brown

    The ribonuclease P database

    Nucl. Acids Res.

    (1998)
  • N. Larsen et al.

    The signal recognition particle database (SRPDB)

    Nucl. Acids Res.

    (1998)
  • M. Sprinzl et al.

    Compilation of tRNA sequences and sequences of tRNA genes

    Nucl. Acids Res.

    (1998)
  • C. Zwieb et al.

    tmRDB (tmRNA database)

    Nucl. Acids Res.

    (2000)
  • Cited by (321)

    • Growth associated polyhydroxybutyrate production by the novel Zobellellae tiwanensis strain DD5 from banana peels under submerged fermentation

      2020, International Journal of Biological Macromolecules
      Citation Excerpt :

      The tree topologies were evaluated by bootstrap analysis (1000 replications). Further, secondary structure analysis of rRNA was conducted using M-fold web server (http://www.bioinfo.rpi.edu/application/mfold) [21]. Secondary structure was analyzed based on the number of stems, multiple loops, hairpin loops and bulges.

    • Comparative analysis of PHAs production by Bacillus megaterium OUAT 016 under submerged and solid-state fermentation

      2020, Saudi Journal of Biological Sciences
      Citation Excerpt :

      The tree topologies were evaluated by bootstrap analysis (1000 replications). Further, secondary structure analysis of rRNA was conducted by M-fold web server (http://www.bioinfo.rpi.edu/application/mfold) (Mathews and Turner, 2002) to get more stable structure of RNA with lowest free energy (Mohapatra et al., 2016a). Bacterial cell biomass yield is regulated by several growth parameters.

    • Accurate prediction of secondary structure of tRNAs

      2019, Biochemical and Biophysical Research Communications
    • Identifying and validating small molecules interacting with RNA (SMIRNAs)

      2019, Methods in Enzymology
      Citation Excerpt :

      It is thus provocative to think that RNA should be considered in toxicological optimization of small molecule drug candidates. Decades of research has enabled annotation of folded structures within an RNA sequence by various methods including phylogenic comparison, free energy minimization with or without experimental constraints (Mathews et al., 2004), and a combination of conservation and free energy minimization (Mathews & Turner, 2002). Protocols to annotate RNA structure from sequence using free energy minimization are available (Mathews, 2014).

    View all citing articles on Scopus
    1

    Edited by I. Tinoco

    View full text