Review
Feature Review
De novo prediction of structured RNAs from genomic sequences

https://doi.org/10.1016/j.tibtech.2009.09.006Get rights and content

Growing recognition of the numerous, diverse and important roles played by non-coding RNA in all organisms motivates better elucidation of these cellular components. Comparative genomics is a powerful tool for this task and is arguably preferable to any high-throughput experimental technology currently available, because evolutionary conservation highlights functionally important regions. Conserved secondary structure, rather than primary sequence, is the hallmark of many functionally important RNAs, because compensatory substitutions in base-paired regions preserve structure. Unfortunately, such substitutions also obscure sequence identity and confound alignment algorithms, which complicates analysis greatly. This paper surveys recent computational advances in this difficult arena, which have enabled genome-scale prediction of cross-species conserved RNA elements. These predictions suggest that a wealth of these elements indeed exist.

Introduction

Non-coding RNAs (ncRNAs) are functional transcripts that do not encode proteins. A handful of examples, such as transfer and ribosomal RNAs, have been well known since the dawn of molecular biology, and probably have existed since the dawn of life. These few examples have critical and deeply central roles, but ironically, elucidation of the full spectrum of ncRNA activity has received relatively little attention. Within the past 10–15 years, however, several striking discoveries, including RNA interference, microRNAs and riboswitches, have demonstrated that RNAs have unexpectedly diverse, sophisticated and important roles in all living organisms, which has sparked renewed interest in the “modern RNA world” [1]. To hint at the scope of the issue, only 1.2% of the human genome encodes protein [2], but recent data suggest that 90% of the genome is transcribed on one or both strands, to some extent, at some time, in some tissues [3]. The significance of this observation remains unclear and controversial, but a growing body of evidence points to the presence of many functionally important non-coding transcripts [4]. Even if the bulk of the non-coding transcription is noise, this still provides a vast substrate upon which natural selection could have acted to generate a multitude of biologically important ncRNAs. Thus, even striking ncRNA-related discoveries, such as microRNAs, may only be the tip of the iceberg.

Interest in conducting computational screens for functional RNAs has risen with their increased prominence [5]. However, in contrast to protein coding genes, whose regular codon structure provides strong signals within nucleotide sequences, the signals for ncRNAs are subtler. For example, many ncRNAs are believed to be processed out of longer primary transcripts, including introns of protein-coding genes, and hence, lack features such as proximal promoters in addition to codon structure [1]. There is, however, one general characteristic shared by many (but not all) known ncRNAs: they fold into complex shapes that are crucial to function and are thus conserved. Although prediction of RNA 3D structure is less well developed than protein structure prediction, the prediction of RNA secondary structure (the set of intramolecular, largely Watson–Crick, base pairs that define the fundamental units from which the tertiary structure is formed) is reasonably well understood and computationally tractable. Furthermore, secondary structure also tends to be conserved, whereas primary sequences often evolve more rapidly (Figure 1), and are of limited use except when searching for close homologs. These facts make secondary structure the key signal to be exploited in ncRNA prediction. Although secondary structure (henceforth, simply structure) prediction is tractable, it is not trivial (e.g. involving interactions between nucleotides at variable and sometimes large distances in the primary sequence), which makes the problem challenging intellectually and computationally [5].

Most RNA molecules will fold into some secondary structure, but given the importance of secondary structure in functional ncRNAs, it is reasonable to ask whether they have more stable structures than random genomic sequences. A negative answer to this question was provided about a decade ago. That is, the folding energy of RNA genes is not distinguished easily from that of appropriately chosen background sequences, such as randomly shuffled versions of the RNA gene itself 6, 7. Therefore, a simple screening approach that seeks unusually stable genomic segments, will generally not work. MicroRNAs, whose precursors form unusually stable stem-loops, are a notable exception 8, 9.

Although searching for possible structured elements in one nucleotide sequence is generally ineffective, as discussed above, searching in multiple (orthologous) sequences can be highly effective, since evolutionary conservation highlights functionally important regions of all kinds. Importantly, such searches leverage the rapidly increasing body of comparative genomic sequence data. A key issue, however, is that the evolutionary signature of an RNA gene is quite different from that of a protein-coding gene. In particular, as noted earlier, the nucleotide sequence of an ncRNA might evolve relatively rapidly, which makes identification and alignment of orthologous sequences difficult or impossible, especially when using tools that focus only on sequences. However, patterns of compensating base changes, for example, an A–U base pair in a human RNA sequence that corresponds to a C–G pair in mice, can provide evidence to support the existence of a conserved RNA structure (without requiring conserved sequence; Figure 1) and insight into the structure itself. Indeed, short of X-ray crystallography, this type of comparative analysis, done carefully by human experts [10], has been the gold standard for RNA secondary structure prediction for more than 40 years. In a nutshell, this highlights the key challenges in computational prediction of ncRNAs: to find orthologous regions, expose the common structure therein, and do so rapidly and accurately.

The comparative approach is important for an additional reason. Over the next few years, we expect that emerging technology such as high-throughput sequencing of RNA (RNAseq [11]) will reveal the transcriptomes of many organisms with unprecedented depth and precision. Yet, given the extensive breadth of genomic transcription now observed, at least in mammals, evidence of transcription can no longer be taken as proof of functional importance. Any observed transcript might just be noise or incompletely degraded detritus that arises from the expression of some nearby, functionally important RNA. Furthermore, experimental protocols will remain limited with regard to the diversity of species, cell types, states, growth and stress conditions that are probed. Consequently, lack of measured expression of a given genomic segment is not proof of lack of function. By contrast, evolutionary conservation strongly suggests functional importance, whether or not expression has already been verified experimentally. This does not deny the value of experimental evidence, of course, nor the existence of functionally important species-specific ncRNAs, but merely argues that detection of evolutionarily conserved ncRNAs by comparative genomics is a powerful tool, and belongs in any effort to understand living systems.

Computational search for conserved RNA structure does have certain important limitations. We highlight two of them here, because they color much of what follows. First, these searches are expensive computationally, principally because of the nature of the underlying RNA folding algorithms that need to be applied to these multiple sequences 12, 13. Even single-sequence folding algorithms have run times that grow as the cube of the sequence length. Applied naively, however fast screening of a sequence of 1 kb is, it will be 1000 times slower for 10 kb and 1 million times slower for 100 kb. Hence, all successful programs in this arena are engineered carefully to control run time, which entails some, hopefully modest, loss of accuracy on long genomic sequences. For example, one simple, widely used strategy is the “sliding window” approach, wherein the genomic sequence is cut into multiple, overlapping, fixed-length segments (windows) that are processed separately. This obviously limits the cubic run time penalty to the length of the window, but unfortunately, also limits the lengths of discoverable structures and risks arbitrarily truncating them. Even using substantially more sophisticated techniques, genome-scale ncRNA analyses often consume tens to hundreds of computer years. These high computational costs are one reason why ncRNA gene finding is still in its infancy.

The second significant limitation of these general searches for conserved RNA secondary structures is more conceptual. It is natural to want to think of each element discovered as an ncRNA gene, but the truth is more complex because the approaches described here might generate only a partial picture for each ncRNA. For example, technical limitations related to window boundaries or splicing could result in partial or fragmentary predictions. More intrinsically, some ncRNAs lack conserved secondary structures, or may have only patches of conserved structure embedded in longer, largely unstructured transcripts. Additionally, conserved, functionally important RNA structures, such as the selenocysteine insertion sequence (SECIS element), are known to exist in mRNAs, usually in their untranslated regions. Therefore, identification of RNA structures in genomic data should trigger post-processing steps and follow-up experiments to characterize transcript boundaries (and function) more precisely. For reasons of simplicity, however, we will refer to individual conserved structures as ncRNA genes.

This review focuses explicitly on computational prediction of ncRNA elements by comparative genomics, that is, the discovery of conserved structured elements in multiple genomic sequences. Other methods for de novo prediction have succeeded in some contexts (e.g. exploiting organism-specific differences in mono- or dinucleotide frequencies of ncRNAs versus background 14, 15, 16), but the comparative approach appears to be the most broadly applicable. We will say little about the related ncRNA homology search problem (finding new instances of a particular RNA family given one or more examples) but this equally important task comes with its own set of issues [17], especially the difficulty of finding homologs outside the phylogenetic range of known examples.

Section snippets

From RNA folding to gene finding

Even though RNA structure cannot be detected reliably by merely folding single sequences, the principles obtained from folding single sequences are fundamental and often constitute an implicit part of more elaborate methods. For example, to date, no large-scale RNA structure screens have accounted for so-called pseudoknots, because the underlying RNA folding algorithms do not do that. Without pseudoknots, RNA secondary structure can be represented by nested parentheses. For example, the hairpin

In silico screening for RNA structures

Prediction of RNA structure in genomic sequences is related closely to the existing methods for RNA structure prediction. As indicated above, multiple sequences are needed to predict RNA structures reliably. Several strategies exist for RNA structure prediction based on multiple sequences [25], which can be categorized loosely as “align-first,” “fold-first” and “joint.” As the name suggests, align-first strategies start by aligning all sequences using standard multiple sequence alignment tools,

Conclusions and perspectives

To date, annotation in the genome databases relates almost exclusively to protein coding genes. By contrast, the computational screens for ncRNAs described here suggest that a large number of functional ncRNAs remain to be found. This is consistent with the observation that most of the non-coding mammalian genome is transcribed. Computational de novo discovery of ncRNAs within genomic sequences is still in its infancy. Nevertheless, these screens provide a valuable starting point for subsequent

Note added in proof

See Kavanaugh [84] for a recent success with the single-sequence folding energy approach, as well as some investigation of the subtleties inherent in sliding window approaches.

Acknowledgements

We thank the anonymous referees for numerous helpful suggestions. This work was supported in part by the Danish Research Council for Technology and Production, The Lundbeck Foundation and Danish Center for Scientific Computation, NIEHS Grant P30ES07033, and Austrian GEN-AU project “Regulatory ncRNAs.”

References (84)

  • E. Rivas et al.

    Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs

    Bioinformatics

    (2000)
  • C.T. Workman et al.

    No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution

    Nucleic Acids Res.

    (1999)
  • Bonnet, E. et al. (2004). Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free...
  • M. Lindow et al.

    Principles and limitations of computational microRNA gene and target finding

    DNA Cell Biol.

    (2007)
  • N.R. Pace

    Probing RNA structure, function, and history by comparative analysis

    The RNA World

    (1999)
  • N. Cloonan et al.

    Transcriptome content and dynamics at single-nucleotide resolution

    Genome Biol.

    (2008)
  • R. Nussinov et al.

    Fast algorithm for predicting the secondary structure of single-stranded RNA

    Proc. Natl. Acad. Sci. U. S. A.

    (1980)
  • S.R. Eddy

    How do RNA folding algorithms work?

    Nat. Biotechnol.

    (2004)
  • Klein, R.J. et al. (2002) Noncoding RNA genes identified in AT-rich hyperthermophiles.Proc. Natl. Acad. Sci. U. S. A....
  • P. Schattner

    Searching for RNA genes using base-composition statistics

    Nucleic Acids Res.

    (2002)
  • P. Larsson

    De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: performance of Markov-dependent genome feature scoring

    Genome Res.

    (2008)
  • P. Menzel

    The tedious task of finding homologous non-coding RNA genes

    RNA

    (2009)
  • M. Zuker

    Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide

  • I.L. Hofacker

    Fast folding and comparison of RNA secondary structures (The Vienna RNA Package)

    Monatshefte für Chemie (Chemical Monthly)

    (1994)
  • R. Durbin

    Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

    (1998)
  • R.D. Dowell et al.

    Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction

    BMC Bioinformatics

    (2004)
  • S.R. Eddy et al.

    RNA sequence analysis using covariance models

    Nucleic Acids Res.

    (1994)
  • Y. Sakakibara

    Stochastic context-free grammars for tRNA modeling

    Nucleic Acids Res.

    (1994)
  • M. Andronescu

    Efficient parameter estimation for RNA secondary structure prediction

    Bioinformatics

    (2007)
  • P.P. Gardner et al.

    A comprehensive comparison of comparative RNA structure prediction approaches

    BMC Bioinformatics

    (2004)
  • D.D. Sankoff

    Simultaneous solution of the RNA folding, alignment and protosequence problems

    SIAM J. Appl. Math.

    (1985)
  • Seemann, S. et al. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple...
  • S. Janssen

    Shape based indexing for faster search of RNA family databases

    BMC Bioinformatics

    (2008)
  • J. Reeder

    Locomotif: from graphical motif description to RNA motif search

    Bioinformatics

    (2007)
  • S.M. Marquez

    Structural implications of novel diversity in eucaryal RNase P RNA

    RNA

    (2005)
  • E.P. Nawrocki

    Infernal 1.0: inference of RNA alignments

    Bioinformatics

    (2009)
  • S. Will

    Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering

    PLoS Comput. Biol.

    (2007)
  • E. Rivas et al.

    Noncoding RNA gene detection using comparative sequence analysis

    BMC Bioinformatics

    (2001)
  • J.P. McCutcheon et al.

    Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics

    Nucleic Acids Res.

    (2003)
  • P. Clote

    Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency

    RNA

    (2005)
  • S. Washietl

    Fast and reliable prediction of noncoding RNAs

    Proc. Natl. Acad. Sci. U. S. A.

    (2005)
  • S. Washietl

    Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome

    Nat. Biotechnol.

    (2005)
  • Cited by (55)

    • The ensemble diversity of non-coding RNA structure is lower than random sequence

      2018, Non-coding RNA Research
      Citation Excerpt :

      Negative z-scores indicate the number of standard deviations more stable than random is a native RNA sequence [1]. The ΔG z-score is at the heart of some of the most effective noncoding (nc)RNA prediction algorithms [2–4], and has been successfully used in the analysis of human [5,6], viral [7–9] and other genomes [10,11]. In addition to the energetically optimal conformation, RNAs may fold into near-energy suboptimal conformations that may be populated and play functional roles [12,13].

    • Genome alignment

      2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics
    • Recent advances in RNA folding

      2017, Journal of Biotechnology
      Citation Excerpt :

      Our current knowledge of ncRNA genes is far from complete, however. Even in the age of efficient RNA-seq methods, it is still of interest to find evidence for evolutionarily conserved, and thus likely functional, RNA structure (see Gorodkin et al., 2010; Backofen and Hess, 2010 for recent reviews). Over the years, several types of tools have been devised for this purpose.

    • Improving RNA secondary structure prediction with structure mapping data

      2015, Methods in Enzymology
      Citation Excerpt :

      The secondary structure generally forms faster and is more stable than the additional contacts that mediate the tertiary structure; therefore, secondary structure can largely be considered independently of the tertiary fold (Tinoco Jr. & Bustamante, 1999). Studying RNA at this level of abstraction has proved useful for identification of functional RNA in genomes (Gorodkin et al., 2010; Gruber, Findeiss, Washietl, Hofacker, & Stadler, 2010; Klein & Eddy, 2003; Macke et al., 2001; Nawrocki, Kolbe, & Eddy, 2009; Torarinsson, Sawera, Havgaard, Fredholm, & Gorodkin, 2006; Uzilov, Keegan, & Mathews, 2006; Yao, Weinberg, & Ruzzo, 2006), prediction of accessible sites for design of siRNA (Heale, Soifer, Bowers, & Rossi, 2005; Lu & Mathews, 2007; Tafer et al., 2008), and for design of new RNA folds (Garcia-Martin, Clote, & Dotu, 2013; Hofacker et al., 1994; Lee et al., 2014; Zadeh et al., 2010). There are two general approaches to determine the secondary structure of an RNA molecule: comparative sequence analysis and computational secondary structure prediction.

    • XLincRNAs: Genomics, evolution, and mechanisms

      2013, Cell
      Citation Excerpt :

      If many lincRNAs contained short, highly structured regions critical for function, then these lincRNAs would have regions with evolutionary conserved secondary structures. Given alignable sequences, several computational tools (reviewed in Gorodkin et al., 2010) can detect such regions. Surprisingly, depending on the lincRNA set studied, such predicted structures are either depleted or only mildly enriched in lincRNA exons (Marques and Ponting, 2009; I.U. and D.P.B., unpublished data).

    View all citing articles on Scopus
    View full text