Trends in Biotechnology
ReviewFeature ReviewDe novo prediction of structured RNAs from genomic sequences
Introduction
Non-coding RNAs (ncRNAs) are functional transcripts that do not encode proteins. A handful of examples, such as transfer and ribosomal RNAs, have been well known since the dawn of molecular biology, and probably have existed since the dawn of life. These few examples have critical and deeply central roles, but ironically, elucidation of the full spectrum of ncRNA activity has received relatively little attention. Within the past 10–15 years, however, several striking discoveries, including RNA interference, microRNAs and riboswitches, have demonstrated that RNAs have unexpectedly diverse, sophisticated and important roles in all living organisms, which has sparked renewed interest in the “modern RNA world” [1]. To hint at the scope of the issue, only 1.2% of the human genome encodes protein [2], but recent data suggest that 90% of the genome is transcribed on one or both strands, to some extent, at some time, in some tissues [3]. The significance of this observation remains unclear and controversial, but a growing body of evidence points to the presence of many functionally important non-coding transcripts [4]. Even if the bulk of the non-coding transcription is noise, this still provides a vast substrate upon which natural selection could have acted to generate a multitude of biologically important ncRNAs. Thus, even striking ncRNA-related discoveries, such as microRNAs, may only be the tip of the iceberg.
Interest in conducting computational screens for functional RNAs has risen with their increased prominence [5]. However, in contrast to protein coding genes, whose regular codon structure provides strong signals within nucleotide sequences, the signals for ncRNAs are subtler. For example, many ncRNAs are believed to be processed out of longer primary transcripts, including introns of protein-coding genes, and hence, lack features such as proximal promoters in addition to codon structure [1]. There is, however, one general characteristic shared by many (but not all) known ncRNAs: they fold into complex shapes that are crucial to function and are thus conserved. Although prediction of RNA 3D structure is less well developed than protein structure prediction, the prediction of RNA secondary structure (the set of intramolecular, largely Watson–Crick, base pairs that define the fundamental units from which the tertiary structure is formed) is reasonably well understood and computationally tractable. Furthermore, secondary structure also tends to be conserved, whereas primary sequences often evolve more rapidly (Figure 1), and are of limited use except when searching for close homologs. These facts make secondary structure the key signal to be exploited in ncRNA prediction. Although secondary structure (henceforth, simply structure) prediction is tractable, it is not trivial (e.g. involving interactions between nucleotides at variable and sometimes large distances in the primary sequence), which makes the problem challenging intellectually and computationally [5].
Most RNA molecules will fold into some secondary structure, but given the importance of secondary structure in functional ncRNAs, it is reasonable to ask whether they have more stable structures than random genomic sequences. A negative answer to this question was provided about a decade ago. That is, the folding energy of RNA genes is not distinguished easily from that of appropriately chosen background sequences, such as randomly shuffled versions of the RNA gene itself 6, 7. Therefore, a simple screening approach that seeks unusually stable genomic segments, will generally not work. MicroRNAs, whose precursors form unusually stable stem-loops, are a notable exception 8, 9.
Although searching for possible structured elements in one nucleotide sequence is generally ineffective, as discussed above, searching in multiple (orthologous) sequences can be highly effective, since evolutionary conservation highlights functionally important regions of all kinds. Importantly, such searches leverage the rapidly increasing body of comparative genomic sequence data. A key issue, however, is that the evolutionary signature of an RNA gene is quite different from that of a protein-coding gene. In particular, as noted earlier, the nucleotide sequence of an ncRNA might evolve relatively rapidly, which makes identification and alignment of orthologous sequences difficult or impossible, especially when using tools that focus only on sequences. However, patterns of compensating base changes, for example, an A–U base pair in a human RNA sequence that corresponds to a C–G pair in mice, can provide evidence to support the existence of a conserved RNA structure (without requiring conserved sequence; Figure 1) and insight into the structure itself. Indeed, short of X-ray crystallography, this type of comparative analysis, done carefully by human experts [10], has been the gold standard for RNA secondary structure prediction for more than 40 years. In a nutshell, this highlights the key challenges in computational prediction of ncRNAs: to find orthologous regions, expose the common structure therein, and do so rapidly and accurately.
The comparative approach is important for an additional reason. Over the next few years, we expect that emerging technology such as high-throughput sequencing of RNA (RNAseq [11]) will reveal the transcriptomes of many organisms with unprecedented depth and precision. Yet, given the extensive breadth of genomic transcription now observed, at least in mammals, evidence of transcription can no longer be taken as proof of functional importance. Any observed transcript might just be noise or incompletely degraded detritus that arises from the expression of some nearby, functionally important RNA. Furthermore, experimental protocols will remain limited with regard to the diversity of species, cell types, states, growth and stress conditions that are probed. Consequently, lack of measured expression of a given genomic segment is not proof of lack of function. By contrast, evolutionary conservation strongly suggests functional importance, whether or not expression has already been verified experimentally. This does not deny the value of experimental evidence, of course, nor the existence of functionally important species-specific ncRNAs, but merely argues that detection of evolutionarily conserved ncRNAs by comparative genomics is a powerful tool, and belongs in any effort to understand living systems.
Computational search for conserved RNA structure does have certain important limitations. We highlight two of them here, because they color much of what follows. First, these searches are expensive computationally, principally because of the nature of the underlying RNA folding algorithms that need to be applied to these multiple sequences 12, 13. Even single-sequence folding algorithms have run times that grow as the cube of the sequence length. Applied naively, however fast screening of a sequence of 1 kb is, it will be 1000 times slower for 10 kb and 1 million times slower for 100 kb. Hence, all successful programs in this arena are engineered carefully to control run time, which entails some, hopefully modest, loss of accuracy on long genomic sequences. For example, one simple, widely used strategy is the “sliding window” approach, wherein the genomic sequence is cut into multiple, overlapping, fixed-length segments (windows) that are processed separately. This obviously limits the cubic run time penalty to the length of the window, but unfortunately, also limits the lengths of discoverable structures and risks arbitrarily truncating them. Even using substantially more sophisticated techniques, genome-scale ncRNA analyses often consume tens to hundreds of computer years. These high computational costs are one reason why ncRNA gene finding is still in its infancy.
The second significant limitation of these general searches for conserved RNA secondary structures is more conceptual. It is natural to want to think of each element discovered as an ncRNA gene, but the truth is more complex because the approaches described here might generate only a partial picture for each ncRNA. For example, technical limitations related to window boundaries or splicing could result in partial or fragmentary predictions. More intrinsically, some ncRNAs lack conserved secondary structures, or may have only patches of conserved structure embedded in longer, largely unstructured transcripts. Additionally, conserved, functionally important RNA structures, such as the selenocysteine insertion sequence (SECIS element), are known to exist in mRNAs, usually in their untranslated regions. Therefore, identification of RNA structures in genomic data should trigger post-processing steps and follow-up experiments to characterize transcript boundaries (and function) more precisely. For reasons of simplicity, however, we will refer to individual conserved structures as ncRNA genes.
This review focuses explicitly on computational prediction of ncRNA elements by comparative genomics, that is, the discovery of conserved structured elements in multiple genomic sequences. Other methods for de novo prediction have succeeded in some contexts (e.g. exploiting organism-specific differences in mono- or dinucleotide frequencies of ncRNAs versus background 14, 15, 16), but the comparative approach appears to be the most broadly applicable. We will say little about the related ncRNA homology search problem (finding new instances of a particular RNA family given one or more examples) but this equally important task comes with its own set of issues [17], especially the difficulty of finding homologs outside the phylogenetic range of known examples.
Section snippets
From RNA folding to gene finding
Even though RNA structure cannot be detected reliably by merely folding single sequences, the principles obtained from folding single sequences are fundamental and often constitute an implicit part of more elaborate methods. For example, to date, no large-scale RNA structure screens have accounted for so-called pseudoknots, because the underlying RNA folding algorithms do not do that. Without pseudoknots, RNA secondary structure can be represented by nested parentheses. For example, the hairpin
In silico screening for RNA structures
Prediction of RNA structure in genomic sequences is related closely to the existing methods for RNA structure prediction. As indicated above, multiple sequences are needed to predict RNA structures reliably. Several strategies exist for RNA structure prediction based on multiple sequences [25], which can be categorized loosely as “align-first,” “fold-first” and “joint.” As the name suggests, align-first strategies start by aligning all sequences using standard multiple sequence alignment tools,
Conclusions and perspectives
To date, annotation in the genome databases relates almost exclusively to protein coding genes. By contrast, the computational screens for ncRNAs described here suggest that a large number of functional ncRNAs remain to be found. This is consistent with the observation that most of the non-coding mammalian genome is transcribed. Computational de novo discovery of ncRNAs within genomic sequences is still in its infancy. Nevertheless, these screens provide a valuable starting point for subsequent
Note added in proof
See Kavanaugh [84] for a recent success with the single-sequence folding energy approach, as well as some investigation of the subtleties inherent in sliding window approaches.
Acknowledgements
We thank the anonymous referees for numerous helpful suggestions. This work was supported in part by the Danish Research Council for Technology and Production, The Lundbeck Foundation and Danish Center for Scientific Computation, NIEHS Grant P30ES07033, and Austrian GEN-AU project “Regulatory ncRNAs.”
References (84)
Computational genomics of noncoding RNA genes
Cell
(2002)Basic local alignment search tool
J. Mol. Biol.
(1990)Computational identification of noncoding RNAs in E. coli by comparative genomics
Curr. Biol.
(2001)- et al.
Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics
J. Mol. Biol.
(2004) - et al.
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences
J. Mol. Biol.
(2002) - et al.
Developmental regulation of EVF-1, a novel non-coding RNA transcribed upstream of the mouse Dlx6 gene
Gene Expr. Patterns
(2004) Non-coding RNA genes and the modern RNA world
Nat. Rev. Genet.
(2001)- International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome.Nature...
- ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project.Science, 306,...
The genetic signatures of noncoding RNAs
PLoS Genetics
(2009)
Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs
Bioinformatics
No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution
Nucleic Acids Res.
Principles and limitations of computational microRNA gene and target finding
DNA Cell Biol.
Probing RNA structure, function, and history by comparative analysis
The RNA World
Transcriptome content and dynamics at single-nucleotide resolution
Genome Biol.
Fast algorithm for predicting the secondary structure of single-stranded RNA
Proc. Natl. Acad. Sci. U. S. A.
How do RNA folding algorithms work?
Nat. Biotechnol.
Searching for RNA genes using base-composition statistics
Nucleic Acids Res.
De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: performance of Markov-dependent genome feature scoring
Genome Res.
The tedious task of finding homologous non-coding RNA genes
RNA
Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide
Fast folding and comparison of RNA secondary structures (The Vienna RNA Package)
Monatshefte für Chemie (Chemical Monthly)
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction
BMC Bioinformatics
RNA sequence analysis using covariance models
Nucleic Acids Res.
Stochastic context-free grammars for tRNA modeling
Nucleic Acids Res.
Efficient parameter estimation for RNA secondary structure prediction
Bioinformatics
A comprehensive comparison of comparative RNA structure prediction approaches
BMC Bioinformatics
Simultaneous solution of the RNA folding, alignment and protosequence problems
SIAM J. Appl. Math.
Shape based indexing for faster search of RNA family databases
BMC Bioinformatics
Locomotif: from graphical motif description to RNA motif search
Bioinformatics
Structural implications of novel diversity in eucaryal RNase P RNA
RNA
Infernal 1.0: inference of RNA alignments
Bioinformatics
Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering
PLoS Comput. Biol.
Noncoding RNA gene detection using comparative sequence analysis
BMC Bioinformatics
Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics
Nucleic Acids Res.
Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency
RNA
Fast and reliable prediction of noncoding RNAs
Proc. Natl. Acad. Sci. U. S. A.
Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome
Nat. Biotechnol.
Cited by (55)
The ensemble diversity of non-coding RNA structure is lower than random sequence
2018, Non-coding RNA ResearchCitation Excerpt :Negative z-scores indicate the number of standard deviations more stable than random is a native RNA sequence [1]. The ΔG z-score is at the heart of some of the most effective noncoding (nc)RNA prediction algorithms [2–4], and has been successfully used in the analysis of human [5,6], viral [7–9] and other genomes [10,11]. In addition to the energetically optimal conformation, RNAs may fold into near-energy suboptimal conformations that may be populated and play functional roles [12,13].
Genome alignment
2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of BioinformaticsRecent advances in RNA folding
2017, Journal of BiotechnologyCitation Excerpt :Our current knowledge of ncRNA genes is far from complete, however. Even in the age of efficient RNA-seq methods, it is still of interest to find evidence for evolutionarily conserved, and thus likely functional, RNA structure (see Gorodkin et al., 2010; Backofen and Hess, 2010 for recent reviews). Over the years, several types of tools have been devised for this purpose.
Improving RNA secondary structure prediction with structure mapping data
2015, Methods in EnzymologyCitation Excerpt :The secondary structure generally forms faster and is more stable than the additional contacts that mediate the tertiary structure; therefore, secondary structure can largely be considered independently of the tertiary fold (Tinoco Jr. & Bustamante, 1999). Studying RNA at this level of abstraction has proved useful for identification of functional RNA in genomes (Gorodkin et al., 2010; Gruber, Findeiss, Washietl, Hofacker, & Stadler, 2010; Klein & Eddy, 2003; Macke et al., 2001; Nawrocki, Kolbe, & Eddy, 2009; Torarinsson, Sawera, Havgaard, Fredholm, & Gorodkin, 2006; Uzilov, Keegan, & Mathews, 2006; Yao, Weinberg, & Ruzzo, 2006), prediction of accessible sites for design of siRNA (Heale, Soifer, Bowers, & Rossi, 2005; Lu & Mathews, 2007; Tafer et al., 2008), and for design of new RNA folds (Garcia-Martin, Clote, & Dotu, 2013; Hofacker et al., 1994; Lee et al., 2014; Zadeh et al., 2010). There are two general approaches to determine the secondary structure of an RNA molecule: comparative sequence analysis and computational secondary structure prediction.
XLincRNAs: Genomics, evolution, and mechanisms
2013, CellCitation Excerpt :If many lincRNAs contained short, highly structured regions critical for function, then these lincRNAs would have regions with evolutionary conserved secondary structures. Given alignable sequences, several computational tools (reviewed in Gorodkin et al., 2010) can detect such regions. Surprisingly, depending on the lincRNA set studied, such predicted structures are either depleted or only mildly enriched in lincRNA exons (Marques and Ponting, 2009; I.U. and D.P.B., unpublished data).