De novo prediction of structured RNAs from genomic sequences

doi:10.1016/j.tibtech.2009.09.006

Trends in Biotechnology

Volume 28, Issue 1, January 2010, Pages 9-19

https://doi.org/10.1016/j.tibtech.2009.09.006 Get rights and content

Growing recognition of the numerous, diverse and important roles played by non-coding RNA in all organisms motivates better elucidation of these cellular components. Comparative genomics is a powerful tool for this task and is arguably preferable to any high-throughput experimental technology currently available, because evolutionary conservation highlights functionally important regions. Conserved secondary structure, rather than primary sequence, is the hallmark of many functionally important RNAs, because compensatory substitutions in base-paired regions preserve structure. Unfortunately, such substitutions also obscure sequence identity and confound alignment algorithms, which complicates analysis greatly. This paper surveys recent computational advances in this difficult arena, which have enabled genome-scale prediction of cross-species conserved RNA elements. These predictions suggest that a wealth of these elements indeed exist.

Introduction

Non-coding RNAs (ncRNAs) are functional transcripts that do not encode proteins. A handful of examples, such as transfer and ribosomal RNAs, have been well known since the dawn of molecular biology, and probably have existed since the dawn of life. These few examples have critical and deeply central roles, but ironically, elucidation of the full spectrum of ncRNA activity has received relatively little attention. Within the past 10–15 years, however, several striking discoveries, including RNA interference, microRNAs and riboswitches, have demonstrated that RNAs have unexpectedly diverse, sophisticated and important roles in all living organisms, which has sparked renewed interest in the “modern RNA world” [1]. To hint at the scope of the issue, only 1.2% of the human genome encodes protein [2], but recent data suggest that 90% of the genome is transcribed on one or both strands, to some extent, at some time, in some tissues [3]. The significance of this observation remains unclear and controversial, but a growing body of evidence points to the presence of many functionally important non-coding transcripts [4]. Even if the bulk of the non-coding transcription is noise, this still provides a vast substrate upon which natural selection could have acted to generate a multitude of biologically important ncRNAs. Thus, even striking ncRNA-related discoveries, such as microRNAs, may only be the tip of the iceberg.

Interest in conducting computational screens for functional RNAs has risen with their increased prominence [5]. However, in contrast to protein coding genes, whose regular codon structure provides strong signals within nucleotide sequences, the signals for ncRNAs are subtler. For example, many ncRNAs are believed to be processed out of longer primary transcripts, including introns of protein-coding genes, and hence, lack features such as proximal promoters in addition to codon structure [1]. There is, however, one general characteristic shared by many (but not all) known ncRNAs: they fold into complex shapes that are crucial to function and are thus conserved. Although prediction of RNA 3D structure is less well developed than protein structure prediction, the prediction of RNA secondary structure (the set of intramolecular, largely Watson–Crick, base pairs that define the fundamental units from which the tertiary structure is formed) is reasonably well understood and computationally tractable. Furthermore, secondary structure also tends to be conserved, whereas primary sequences often evolve more rapidly (Figure 1), and are of limited use except when searching for close homologs. These facts make secondary structure the key signal to be exploited in ncRNA prediction. Although secondary structure (henceforth, simply structure) prediction is tractable, it is not trivial (e.g. involving interactions between nucleotides at variable and sometimes large distances in the primary sequence), which makes the problem challenging intellectually and computationally [5].

Most RNA molecules will fold into some secondary structure, but given the importance of secondary structure in functional ncRNAs, it is reasonable to ask whether they have more stable structures than random genomic sequences. A negative answer to this question was provided about a decade ago. That is, the folding energy of RNA genes is not distinguished easily from that of appropriately chosen background sequences, such as randomly shuffled versions of the RNA gene itself 6, 7. Therefore, a simple screening approach that seeks unusually stable genomic segments, will generally not work. MicroRNAs, whose precursors form unusually stable stem-loops, are a notable exception 8, 9.

Although searching for possible structured elements in one nucleotide sequence is generally ineffective, as discussed above, searching in multiple (orthologous) sequences can be highly effective, since evolutionary conservation highlights functionally important regions of all kinds. Importantly, such searches leverage the rapidly increasing body of comparative genomic sequence data. A key issue, however, is that the evolutionary signature of an RNA gene is quite different from that of a protein-coding gene. In particular, as noted earlier, the nucleotide sequence of an ncRNA might evolve relatively rapidly, which makes identification and alignment of orthologous sequences difficult or impossible, especially when using tools that focus only on sequences. However, patterns of compensating base changes, for example, an A–U base pair in a human RNA sequence that corresponds to a C–G pair in mice, can provide evidence to support the existence of a conserved RNA structure (without requiring conserved sequence; Figure 1) and insight into the structure itself. Indeed, short of X-ray crystallography, this type of comparative analysis, done carefully by human experts [10], has been the gold standard for RNA secondary structure prediction for more than 40 years. In a nutshell, this highlights the key challenges in computational prediction of ncRNAs: to find orthologous regions, expose the common structure therein, and do so rapidly and accurately.

The comparative approach is important for an additional reason. Over the next few years, we expect that emerging technology such as high-throughput sequencing of RNA (RNAseq [11]) will reveal the transcriptomes of many organisms with unprecedented depth and precision. Yet, given the extensive breadth of genomic transcription now observed, at least in mammals, evidence of transcription can no longer be taken as proof of functional importance. Any observed transcript might just be noise or incompletely degraded detritus that arises from the expression of some nearby, functionally important RNA. Furthermore, experimental protocols will remain limited with regard to the diversity of species, cell types, states, growth and stress conditions that are probed. Consequently, lack of measured expression of a given genomic segment is not proof of lack of function. By contrast, evolutionary conservation strongly suggests functional importance, whether or not expression has already been verified experimentally. This does not deny the value of experimental evidence, of course, nor the existence of functionally important species-specific ncRNAs, but merely argues that detection of evolutionarily conserved ncRNAs by comparative genomics is a powerful tool, and belongs in any effort to understand living systems.

Computational search for conserved RNA structure does have certain important limitations. We highlight two of them here, because they color much of what follows. First, these searches are expensive computationally, principally because of the nature of the underlying RNA folding algorithms that need to be applied to these multiple sequences 12, 13. Even single-sequence folding algorithms have run times that grow as the cube of the sequence length. Applied naively, however fast screening of a sequence of 1 kb is, it will be 1000 times slower for 10 kb and 1 million times slower for 100 kb. Hence, all successful programs in this arena are engineered carefully to control run time, which entails some, hopefully modest, loss of accuracy on long genomic sequences. For example, one simple, widely used strategy is the “sliding window” approach, wherein the genomic sequence is cut into multiple, overlapping, fixed-length segments (windows) that are processed separately. This obviously limits the cubic run time penalty to the length of the window, but unfortunately, also limits the lengths of discoverable structures and risks arbitrarily truncating them. Even using substantially more sophisticated techniques, genome-scale ncRNA analyses often consume tens to hundreds of computer years. These high computational costs are one reason why ncRNA gene finding is still in its infancy.

The second significant limitation of these general searches for conserved RNA secondary structures is more conceptual. It is natural to want to think of each element discovered as an ncRNA gene, but the truth is more complex because the approaches described here might generate only a partial picture for each ncRNA. For example, technical limitations related to window boundaries or splicing could result in partial or fragmentary predictions. More intrinsically, some ncRNAs lack conserved secondary structures, or may have only patches of conserved structure embedded in longer, largely unstructured transcripts. Additionally, conserved, functionally important RNA structures, such as the selenocysteine insertion sequence (SECIS element), are known to exist in mRNAs, usually in their untranslated regions. Therefore, identification of RNA structures in genomic data should trigger post-processing steps and follow-up experiments to characterize transcript boundaries (and function) more precisely. For reasons of simplicity, however, we will refer to individual conserved structures as ncRNA genes.

This review focuses explicitly on computational prediction of ncRNA elements by comparative genomics, that is, the discovery of conserved structured elements in multiple genomic sequences. Other methods for de novo prediction have succeeded in some contexts (e.g. exploiting organism-specific differences in mono- or dinucleotide frequencies of ncRNAs versus background 14, 15, 16), but the comparative approach appears to be the most broadly applicable. We will say little about the related ncRNA homology search problem (finding new instances of a particular RNA family given one or more examples) but this equally important task comes with its own set of issues [17], especially the difficulty of finding homologs outside the phylogenetic range of known examples.

Section snippets

From RNA folding to gene finding

Even though RNA structure cannot be detected reliably by merely folding single sequences, the principles obtained from folding single sequences are fundamental and often constitute an implicit part of more elaborate methods. For example, to date, no large-scale RNA structure screens have accounted for so-called pseudoknots, because the underlying RNA folding algorithms do not do that. Without pseudoknots, RNA secondary structure can be represented by nested parentheses. For example, the hairpin

In silico screening for RNA structures

Prediction of RNA structure in genomic sequences is related closely to the existing methods for RNA structure prediction. As indicated above, multiple sequences are needed to predict RNA structures reliably. Several strategies exist for RNA structure prediction based on multiple sequences [25], which can be categorized loosely as “align-first,” “fold-first” and “joint.” As the name suggests, align-first strategies start by aligning all sequences using standard multiple sequence alignment tools,

Conclusions and perspectives

To date, annotation in the genome databases relates almost exclusively to protein coding genes. By contrast, the computational screens for ncRNAs described here suggest that a large number of functional ncRNAs remain to be found. This is consistent with the observation that most of the non-coding mammalian genome is transcribed. Computational de novo discovery of ncRNAs within genomic sequences is still in its infancy. Nevertheless, these screens provide a valuable starting point for subsequent

Note added in proof

See Kavanaugh [84] for a recent success with the single-sequence folding energy approach, as well as some investigation of the subtleties inherent in sliding window approaches.

Acknowledgements

We thank the anonymous referees for numerous helpful suggestions. This work was supported in part by the Danish Research Council for Technology and Production, The Lundbeck Foundation and Danish Center for Scientific Computation, NIEHS Grant P30ES07033, and Austrian GEN-AU project “Regulatory ncRNAs.”

References (84)

S.R. Eddy
Computational genomics of noncoding RNA genes
Cell
(2002)
S.F. Altschul
Basic local alignment search tool
J. Mol. Biol.
(1990)
E. Rivas
Computational identification of noncoding RNAs in E. coli by comparative genomics
Curr. Biol.
(2001)
S. Washietl et al.
Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics
J. Mol. Biol.
(2004)
D.H. Mathews et al.
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences
J. Mol. Biol.
(2002)
J.D. Kohtz et al.
Developmental regulation of EVF-1, a novel non-coding RNA transcribed upstream of the mouse Dlx6 gene
Gene Expr. Patterns
(2004)
S.R. Eddy
Non-coding RNA genes and the modern RNA world
Nat. Rev. Genet.
(2001)
International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome.Nature...
ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project.Science, 306,...
J.S. Mattick
The genetic signatures of noncoding RNAs
PLoS Genetics
(2009)

E. Rivas et al.

Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs

Bioinformatics

(2000)

C.T. Workman et al.

No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution

Nucleic Acids Res.

(1999)

Bonnet, E. et al. (2004). Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free...

M. Lindow et al.

Principles and limitations of computational microRNA gene and target finding

DNA Cell Biol.

(2007)

N.R. Pace

Probing RNA structure, function, and history by comparative analysis

The RNA World

(1999)

N. Cloonan et al.

Transcriptome content and dynamics at single-nucleotide resolution

Genome Biol.

(2008)

R. Nussinov et al.

Fast algorithm for predicting the secondary structure of single-stranded RNA

Proc. Natl. Acad. Sci. U. S. A.

(1980)

S.R. Eddy

How do RNA folding algorithms work?

Nat. Biotechnol.

(2004)

Klein, R.J. et al. (2002) Noncoding RNA genes identified in AT-rich hyperthermophiles.Proc. Natl. Acad. Sci. U. S. A....

P. Schattner

Searching for RNA genes using base-composition statistics

Nucleic Acids Res.

(2002)

P. Larsson

De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: performance of Markov-dependent genome feature scoring

Genome Res.

(2008)

P. Menzel

The tedious task of finding homologous non-coding RNA genes

RNA

(2009)

M. Zuker

Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide

I.L. Hofacker

Fast folding and comparison of RNA secondary structures (The Vienna RNA Package)

Monatshefte für Chemie (Chemical Monthly)

(1994)

R. Durbin

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

(1998)

R.D. Dowell et al.

Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction

BMC Bioinformatics

(2004)

S.R. Eddy et al.

RNA sequence analysis using covariance models

Nucleic Acids Res.

(1994)

Y. Sakakibara

Stochastic context-free grammars for tRNA modeling

Nucleic Acids Res.

(1994)

M. Andronescu

Efficient parameter estimation for RNA secondary structure prediction

Bioinformatics

(2007)

P.P. Gardner et al.

A comprehensive comparison of comparative RNA structure prediction approaches

BMC Bioinformatics

(2004)

D.D. Sankoff

Simultaneous solution of the RNA folding, alignment and protosequence problems

SIAM J. Appl. Math.

(1985)

Seemann, S. et al. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple...

S. Janssen

Shape based indexing for faster search of RNA family databases

BMC Bioinformatics

(2008)

J. Reeder

Locomotif: from graphical motif description to RNA motif search

Bioinformatics

(2007)

S.M. Marquez

Structural implications of novel diversity in eucaryal RNase P RNA

RNA

(2005)

E.P. Nawrocki

Infernal 1.0: inference of RNA alignments

Bioinformatics

(2009)

S. Will

Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering

PLoS Comput. Biol.

(2007)

E. Rivas et al.

Noncoding RNA gene detection using comparative sequence analysis

BMC Bioinformatics

(2001)

J.P. McCutcheon et al.

Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics

Nucleic Acids Res.

(2003)

P. Clote

Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency

RNA

(2005)

S. Washietl

Fast and reliable prediction of noncoding RNAs

Proc. Natl. Acad. Sci. U. S. A.

(2005)

S. Washietl

Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome

Nat. Biotechnol.

(2005)

Cited by (55)

The ensemble diversity of non-coding RNA structure is lower than random sequence
2018, Non-coding RNA Research
Citation Excerpt :
Negative z-scores indicate the number of standard deviations more stable than random is a native RNA sequence [1]. The ΔG z-score is at the heart of some of the most effective noncoding (nc)RNA prediction algorithms [2–4], and has been successfully used in the analysis of human [5,6], viral [7–9] and other genomes [10,11]. In addition to the energetically optimal conformation, RNAs may fold into near-energy suboptimal conformations that may be populated and play functional roles [12,13].
In addition to energetically optimal structures, RNAs can fold into near energy suboptimal conformations that may be populated and play functional roles. The diversity of this structural ensemble can be estimated using a metric derived from the calculated RNA partition function: the ensemble diversity. In this report, 10 classes of functional RNAs were analyzed: the 5.8S and 5S rRNAs, ribozyme, RNase P, snoRNA, snRNA, SRP RNA, tmRNA, Vault RNA and Y RNA. Representative sequences from each class were mutagenized in two ways: firstly, all possible point mutations were generated and secondly, wild type sequences were randomized to generate multiple scrambled mutants. Compared to the mutants, the native RNA ensemble diversity was predicted to be lower. This finding held true when all available sequences (378,455 sequences) for each RNA class (archived in the RNAcentral database) were analyzed. This suggests that a compact structural ensemble is an evolved characteristic of functional RNAs.
Genome alignment
2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics
Recent advances in RNA folding
2017, Journal of Biotechnology
Citation Excerpt :
Our current knowledge of ncRNA genes is far from complete, however. Even in the age of efficient RNA-seq methods, it is still of interest to find evidence for evolutionarily conserved, and thus likely functional, RNA structure (see Gorodkin et al., 2010; Backofen and Hess, 2010 for recent reviews). Over the years, several types of tools have been devised for this purpose.
In the realm of nucleic acid structures, secondary structure forms a conceptually important intermediate level of description and explains the dominating part of the free energy of structure formation. Secondary structures are well conserved over evolutionary time-scales and for many classes of RNAs evolve slower than the underlying primary sequences. Given the close link between structure and function, secondary structure is routinely used as a basis to explain experimental findings. Recent technological advances, finally, have made it possible to assay secondary structure directly using high throughput methods. From a computational biology point of view, secondary structures have a special role because they can be computed efficiently using exact dynamic programming algorithms.
In this contribution we provide a short overview of RNA folding algorithms, recent additions and variations and address methods to align, compare, and cluster RNA structures, followed by a tabular summary of the most important software suites in the fields.
Improving RNA secondary structure prediction with structure mapping data
2015, Methods in Enzymology
Citation Excerpt :
The secondary structure generally forms faster and is more stable than the additional contacts that mediate the tertiary structure; therefore, secondary structure can largely be considered independently of the tertiary fold (Tinoco Jr. & Bustamante, 1999). Studying RNA at this level of abstraction has proved useful for identification of functional RNA in genomes (Gorodkin et al., 2010; Gruber, Findeiss, Washietl, Hofacker, & Stadler, 2010; Klein & Eddy, 2003; Macke et al., 2001; Nawrocki, Kolbe, & Eddy, 2009; Torarinsson, Sawera, Havgaard, Fredholm, & Gorodkin, 2006; Uzilov, Keegan, & Mathews, 2006; Yao, Weinberg, & Ruzzo, 2006), prediction of accessible sites for design of siRNA (Heale, Soifer, Bowers, & Rossi, 2005; Lu & Mathews, 2007; Tafer et al., 2008), and for design of new RNA folds (Garcia-Martin, Clote, & Dotu, 2013; Hofacker et al., 1994; Lee et al., 2014; Zadeh et al., 2010). There are two general approaches to determine the secondary structure of an RNA molecule: comparative sequence analysis and computational secondary structure prediction.
Methods to probe RNA secondary structure, such as small molecule modifying agents, secondary structure-specific nucleases, inline probing, and SHAPE chemistry, are widely used to study the structure of functional RNA. Computational secondary structure prediction programs can incorporate probing data to predict structure with high accuracy. In this chapter, an overview of current methods for probing RNA secondary structure is provided, including modern high-throughput methods. Methods for guiding secondary structure prediction algorithms using these data are explained, and best practices for using these data are provided. This chapter concludes by listing a number of open questions about how to best use probing data, and what these data can provide.
XLincRNAs: Genomics, evolution, and mechanisms
2013, Cell
Citation Excerpt :
If many lincRNAs contained short, highly structured regions critical for function, then these lincRNAs would have regions with evolutionary conserved secondary structures. Given alignable sequences, several computational tools (reviewed in Gorodkin et al., 2010) can detect such regions. Surprisingly, depending on the lincRNA set studied, such predicted structures are either depleted or only mildly enriched in lincRNA exons (Marques and Ponting, 2009; I.U. and D.P.B., unpublished data).
Long intervening noncoding RNAs (lincRNAs) are transcribed from thousands of loci in mammalian genomes and might play widespread roles in gene regulation and other cellular processes. This Review outlines the emerging understanding of lincRNAs in vertebrate animals, with emphases on how they are being identified and current conclusions and questions regarding their genomics, evolution and mechanisms of action.
Multiple sequence alignments enhance boundary definition of RNA structures
2018, Genes

View all citing articles on Scopus

View full text

Trends in Biotechnology

ReviewFeature ReviewDe novo prediction of structured RNAs from genomic sequences

Introduction

Section snippets

From RNA folding to gene finding

In silico screening for RNA structures

Conclusions and perspectives

Note added in proof

Acknowledgements

Cell

J. Mol. Biol.

Curr. Biol.

J. Mol. Biol.

J. Mol. Biol.

Gene Expr. Patterns

Non-coding RNA genes and the modern RNA world

Nat. Rev. Genet.

The genetic signatures of noncoding RNAs

PLoS Genetics

Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs

Bioinformatics

No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution

Nucleic Acids Res.

Principles and limitations of computational microRNA gene and target finding

DNA Cell Biol.

Probing RNA structure, function, and history by comparative analysis

The RNA World

Transcriptome content and dynamics at single-nucleotide resolution

Genome Biol.

Fast algorithm for predicting the secondary structure of single-stranded RNA

Proc. Natl. Acad. Sci. U. S. A.

How do RNA folding algorithms work?

Nat. Biotechnol.

Searching for RNA genes using base-composition statistics

Nucleic Acids Res.

De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: performance of Markov-dependent genome feature scoring

Genome Res.

The tedious task of finding homologous non-coding RNA genes

RNA

Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide

Fast folding and comparison of RNA secondary structures (The Vienna RNA Package)

Monatshefte für Chemie (Chemical Monthly)

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction

BMC Bioinformatics

RNA sequence analysis using covariance models

Nucleic Acids Res.

Stochastic context-free grammars for tRNA modeling

Nucleic Acids Res.

Efficient parameter estimation for RNA secondary structure prediction

Bioinformatics

A comprehensive comparison of comparative RNA structure prediction approaches

BMC Bioinformatics

Simultaneous solution of the RNA folding, alignment and protosequence problems

SIAM J. Appl. Math.

Shape based indexing for faster search of RNA family databases

BMC Bioinformatics

Locomotif: from graphical motif description to RNA motif search

Bioinformatics

Structural implications of novel diversity in eucaryal RNase P RNA

RNA

Infernal 1.0: inference of RNA alignments

Bioinformatics

Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering

PLoS Comput. Biol.

Noncoding RNA gene detection using comparative sequence analysis

BMC Bioinformatics

Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics

Nucleic Acids Res.

Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency

RNA

Fast and reliable prediction of noncoding RNAs

Proc. Natl. Acad. Sci. U. S. A.

Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome

Nat. Biotechnol.

Review
Feature Review
De novo prediction of structured RNAs from genomic sequences