On the structural repertoire of pools of short, random RNA sequences

doi:10.1016/j.jtbi.2008.02.018

Journal of Theoretical Biology

Volume 252, Issue 4, 21 June 2008, Pages 750-763

https://doi.org/10.1016/j.jtbi.2008.02.018 Get rights and content

Abstract

A detailed knowledge of the mapping between sequence and structure spaces in populations of RNA molecules is essential to better understand their present-day functional properties, to envisage a plausible early evolution of RNA in a prebiotic chemical environment and to improve the design of in vitro evolution experiments, among others. Analysis of natural RNAs, as well as in vitro and computational studies, show that certain RNA structural motifs are much more abundant than others, pointing out a complex relation between sequence and structure. Within this framework, we have investigated computationally the structural properties of a large pool ( $10^{8}$ molecules) of single-stranded, 35 nt-long, random RNA sequences. The secondary structures obtained are ranked and classified into structure families. The number of structures in main families is analytically calculated and compared with the numerical results. This permits a quantification of the fraction of structure space covered by a large pool of sequences. We further show that the number of structural motifs and their frequency is highly unbalanced with respect to the nucleotide composition: simple structures such as stem-loops and hairpins arise from sequences depleted in G, while more complex structures require an enrichment of G. In general, we observe a strong correlation between subfamilies—characterized by a fixed number of paired nucleotides—and nucleotide composition. Our results are compared to the structural repertoire obtained in a second pool where isolated base pairs are prohibited.

Introduction

The distribution of RNA structural motifs within pools of random sequences is extremely heterogeneous, as theoretical studies and observation of natural secondary structures demonstrate (Fontana et al., 1993, Schuster et al., 1994). Knowledge of the relationship between sequence and structure space has a theoretical and practical relevance, among other reasons because structural diversity conditions the spectrum of different functionalities present in—and thus selectable from—a random pool of sequences (Lorsch and Szostak, 1994). The role played by parameters such as the sequence length (Sabeti et al., 1997) or the nucleotide composition (Knight et al., 2005, Kim et al., 2007) has been addressed as a way of modifying the functional diversity of random molecular ensembles. Two frequent goals of those studies are to maximize the structural diversity present in the pool and to enhance the presence of certain structures able to perform new functions (Wilson and Szostak, 1999, Gan et al., 2003).

Every RNA sequence can be mapped onto a secondary structure that corresponds to its minimum free energy folded state. The first mathematical studies on this correspondence readily revealed the huge degeneracy existing between the set of all sequences—genotype space, of magnitude $4^{n}$ if n denotes the length of the sequences—and the set of their possible secondary structures—a first approximation to the phenotype space (Stein and Waterman, 1978, Waterman, 1978). Calculations based on the compatibility between sequences and structures yield estimates of the average number of sequences that fold into a secondary structure. If isolated pairs are allowed in the secondary structure, there are, on average, about $1.402 n^{3 / 2} 1 . 748^{n}$ sequences of length n folding into each possible secondary structure (Stein and Waterman, 1978). However, this huge number is of little practical relevance in the light of empirical observations and computational results with random pools: the so-called common structures are typically many orders of magnitude more frequent than rare structures (Schuster et al., 1994, Grüner et al., 1996a, Joyce, 2004). While common structures are easily obtained, even in small populations, and do not depend strongly on the nucleotide composition, sequences folding into rare structures often need to be designed, for instance by means of inverse folding algorithms (Schuster et al., 1994, Hofacker et al., 1994).

A main concern of experimentalists seeking new ribozyme or aptamer activities is how to deviate the structural composition of the initial pools in the in vitro experiments from average expectations, thus enhancing for instance the presence of rare structures, or forcing the ensemble to be structurally biased towards specific common structures. One approach has been to maximize the length of the sequences in the starting pool in an attempt to increment the number of different motifs available (Bartel and Szostak, 1993). However, quantitative analyses have shown that long sequences offer little advantage to isolate simple motifs, and their effect might be even inhibitory (Sabeti et al., 1997). More recently, attention has focused on how the probability to obtain a fixed structural motif depends on the nucleotide composition (Knight et al., 2005, Kim et al., 2007). Interestingly, though increases in the size of the initial pool should imply an increase in the amount of different structures present, the dependence of structural diversity on population size has been rarely addressed. Furthermore, computational results indicate that the number of different major topological motifs present in the pool depends very weakly on its size (Gevertz et al., 2005). Modular evolution has been suggested as a plausible way to generate complex structures in a constructive way. This approach can be implemented either through the isolation of simple modules from random populations of short sequences, their directed modification and eventual combination (Sabeti et al., 1997), or as the selective evolution of populations towards specific modules, together with their ligation in suitable environments (Manrubia and Briones, 2007). The latter approach is of particular relevance at prebiotic stages, when the biochemical function had to emerge in an unsupervised way.

Though our knowledge of the genotype–phenotype map has expanded largely in the last three decades, our understanding still has to be improved in order to comprehend the multiple implications it has on evolution, on setting the conditions for further selection, and on the dynamic behavior of highly heterogeneous molecular populations. This is the main motivation to undertake the study here presented, where we fold $10^{8}$ random sequences of length $n = 35 nt$ and classify the obtained structures into main structure families. We find correlations between the frequency of certain structure families and the nucleotide composition, and conclude that rare structures are to be found far from the average composition. Hence, they could be enhanced by tuning the fraction of each nucleotide in the sequences. One of our main results concerns the high fraction of sequences folding into topologically simple structure families: most abundant motifs resulting from random polymerization could constitute simple building blocks able to combine into more complex structures.

Section snippets

Distribution and classification of secondary structures with isolated base pairs

In this section, we describe the results of the folding of $10^{8}$ RNA molecules of length 35 nt consisting of random linear sequences composed of the four types of nucleotides A, C, G, and U. In this first part of the study we allow the presence of isolated base pairs in the secondary structure.

Discussion

The structural space of RNA is vastly smaller than its sequence space. In order to delve into the features of such degeneracy, we have folded in silico two pools of $10^{8}$ random RNA sequences of length 35 nt and classified the obtained secondary structures—about $10^{6}$ —in roughly 20 structure families (see Methods and Table 1). We found that HPs are the most probable structures formed by a random RNA sequence of this length. In fact, more than half of the sequences in our pools fold in structures

Programming and computational resources

Simulations have been carried out at the Itanium II cluster of INTA (Instituto Nacional de Técnica Aeroespacial, Spain). For random number generation, we relied on the Mersenne Twister and Ziff's FSR4 algorithms as provided by GNU Scientific Library (GSL), Version 1.7 (see http://www.gnu.org/software/gsl). Although we fold $10^{8}$ molecules, this represents only a very small fraction of the sequence space, formed by $4^{35} ≃ 10^{21}$ sequences. The probability to have a sequence repeated is of order $10^{- 13}$ ,

Acknowledgements

The authors wish to acknowledge the technical assistance of Ruth Lobo and Pilar Viñado with the computations carried out at the Itanium II cluster of INTA, and the support of Isidro Cano and Hewlett-Packard within the project OriGenes.

Author contributions. All authors conceived and designed the study. MS performed the numeric calculations, SCM the analytic calculations. All authors analysed the data and wrote the paper.

Funding. This work was supported by Ministerio de Educación y Ciencia

References (43)

P.C. Anderson et al.
Minimum sequence requirements for selective RNA-ligand binding: a molecular mechanics algorithm using molecular dynamics and free-energy techniques
J. Comp. Chem.
(2006)
R. Backofen et al.
RNAs everywhere: genome-wide annotation of structured RNAs
J. Exp. Zool. (Mol. Dev. Evol.)
(2007)
D.P. Bartel et al.
Isolation of new ribozymes from a large pool of random sequences
Science
(1993)
J.M. Carothers et al.
Informational complexity and functional activity of RNA structures
J. Am. Chem. Soc.
(2004)
W. Fontana et al.
Statistics of RNA secondary structures
Biopolymers
(1993)
H.H. Gan et al.
Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design
Nucleic Acids Res.
(2003)
J. Gevertz et al.
In vitro RNA random pools are not structurally diverse: a computational analysis
RNA
(2005)
W. Grüner et al.
Analysis of RNA sequence structure maps by exhaustive enumeration
I. Neutral networks. Monatsh. Chem.
(1996)
W. Grüner et al.
Analysis of RNA sequence structure maps by exhaustive enumeration. II Structures of neutral networks and shape space covering
Monatsh. Chem.
(1996)
R.R. Gutell et al.
A story: unpaired adenosine bases in ribosomal RNAs
J. Mol. Biol.
(2000)

J. Hackermüller et al.

The effect of RNA secondary structures on RNA-ligand binding and the modifier RNA mechanism: a quantitative model

Gene

(2005)

D.K. Hendrix et al.

RNA structural motifs: building blocks of a modular biomolecule

Quart. Rev. Biophys.

(2005)

M. Hiller et al.

Using RNA secondary structures to guide sequence motif finding towards single-stranded regions

Nucleic Acids Res.

(2006)

I.L. Hofacker et al.

Fast folding and comparison of RNA secondary structures

Monatsh. Chem.

(1994)

I.L. Hofacker et al.

Combinatorics of RNA secondary structures

Discrete Appl. Math.

(1998)

J.A. Jaeger et al.

Improved predictions of secondary structures for RNA

Proc. Natl Acad. Sci. USA

(1989)

G.F. Joyce

Directed evolution of nucleic acid enzymes

Annu. Rev. Biochem.

(2004)

T. Khanam et al.

Poly(a)-binding protein binds to a-rich sequences via RNA-binding domains $1 + 2$ and $3 + 4$

RNA Biol.

(2006)

N. Kim et al.

A computational proposal for designing structured RNA pools for in vitro selection of RNAs

RNA

(2007)

R. Knight et al.

Finding specific RNA motifs: function in a zeptomole world?

RNA

(2003)

R. Knight et al.

Abundance of correctly folded RNA motifs in sequence space calculated on computational grids

Nucleic Acids Res.

(2005)

Cited by (45)

From genotypes to organisms: State-of-the-art and perspectives of a cornerstone in evolutionary dynamics
2021, Physics of Life Reviews
Citation Excerpt :
The phenotype is defined as the minimum energy of a given configuration calculated from a contact potential between neighbouring (but not in the backbone) beads. Because RNA and HP models are relatively tractable, properties such as the distribution of the number of genotypes per phenotype [11,55,56], the phenotypic robustness and evolvability [57,58] (see Box 3.1) or the topological structure of neutral networks [48] could be systematically studied and compared [59]. Given the pivotal role proteins play in cellular processes, the protein sequence-to-structure map, of which the HP model constitutes the simplest realisation, is of great general interest [60–62].
Understanding how genotypes map onto phenotypes, fitness, and eventually organisms is arguably the next major missing piece in a fully predictive theory of evolution. We refer to this generally as the problem of the genotype-phenotype map. Though we are still far from achieving a complete picture of these relationships, our current understanding of simpler questions, such as the structure induced in the space of genotypes by sequences mapped to molecular structures, has revealed important facts that deeply affect the dynamical description of evolutionary processes. Empirical evidence supporting the fundamental relevance of features such as phenotypic bias is mounting as well, while the synthesis of conceptual and experimental progress leads to questioning current assumptions on the nature of evolutionary dynamics—cancer progression models or synthetic biology approaches being notable examples. This work delves with a critical and constructive attitude into our current knowledge of how genotypes map onto molecular phenotypes and organismal functions, and discusses theoretical and empirical avenues to broaden and improve this comprehension. As a final goal, this community should aim at deriving an updated picture of evolutionary processes soundly relying on the structural properties of genotype spaces, as revealed by modern techniques of molecular and functional analysis.
On the nature and origin of biological information: The curious case of RNA
2019, BioSystems
Citation Excerpt :
Reframing biological information in structural terms erases the misconceptions and hyper-astronomical impediments of improbability associated with Shannon information. Even for a given pool of random RNA sequences of a very short length a significant percentage will have the ability to foldup into some limited set of specific structures (Stich et al., 2008). Like arbitrary but exact nucleic acid sequences, capricious but well-defined three-dimensional structures, can be regarded as conveying information if with represents a specific case out of a realm of other possibilities.
Biological information is most commonly thought of in terms of biology’s Central Dogma where DNA is viewed as a linearized code used to synthesize proteins. Using DNA’s chemical cousin, RNA, as a case study we consider how biological information operates outside the linear arrangement of its polymeric subunits. Much like individual pieces of a jigsaw puzzle, particular structures enable biomolecules to undergo precise molecular interactions with one another based on their respective shapes. By exploring the relationship between sequence and structure in RNA we argue that biological information finds its ultimate functional fulfillment in the three-dimensional structural arrangement of its atoms. We show how recurrent structural RNA motifs—operating at the tertiary level of a molecule—provide robust building blocks for the formation of new structural configurations and thereby convey the information required for emergent biological functions. We posit that these same RNA structures, guided by their respective thermodynamic stabilities, experience selective pressure to maintain particular three-dimensional architectures over and above pressures to maintain a particular sequence of nucleotides. Ultimately, this framework for understanding the nature of biological information provides a useful paradigm for understanding its origins and how biological information can result from chaotic prebiotic conditions.
Non-enzymatic recombination of RNA: Ligation in loops
2018, Biochimica et Biophysica Acta - General Subjects
Citation Excerpt :
As mentioned above random recombination is not very useful in formation of long RNA structures so the concern is to find short RNA structures (up to 50 nts according to [12]) prone to recombination preferably occurring at certain sites resulting in formation of long RNA sequences. Prevalence of such structures is low however it gets higher the longer RNA fragment formed [35]. Thus it seems reasonable to “shift” the limit of the search space from 50 nts to, for instance, 100 nts.
While the RNA world hypothesis is widely accepted, it is still far from complete: the existence of self-replicating ribozyme, consisting of potentially hundreds of nucleotides, is a core assumption for the majority of RNA world models. The appearance of such long RNA molecules under prebiotic conditions is not self-evident. Recombination seems to be a plausible way of creating RNA diversity, resulting in the appearance of functional RNAs, capable of self-replicating.
We report here on the study of recombination process modelled with two 96 nts RNA fragments. Detection of recombination products was performed with RT-PCR followed by TA-cloning and Sanger sequencing.
A wide range of recombinant products was detected. We found that (i) the most efficient ligation was observed for RNA species forming bulges or internal loops, with ligation partners located within the loop; (ii) a strong preference was observed for formation of a few types of major products with a large variety of minor products; (iii) ligation could occur with participation of either 2′,3′-cyclophosphate or 5′-ppp; (iv) the presence of key reaction components, i.e. 5′ppp-RNAs, enabled the formation of additional types of product; (v) molecular dynamics simulations of one of the most abundant products suggests that the ligation results in a preferable formation of 2′-5′- rather than 3′-5′-linkages.
The study demonstrates regularities of new RNA molecules formation with non-enzymatic recombination process.
Our findings provide new data supporting the RNA World hypothesis and show the way of new RNA sequences emergence under prebiotic conditions.
Empirically founded genotype-phenotype maps from mammalian cyclic nucleotide-gated ion channels
2014, Journal of Theoretical Biology
A major barrier between evolutionary and functional biology is the difficulty of determining appropriate genotype–phenotype-fitness maps, particularly in metazoans. Concrete perspectives towards unifying these approaches are offered by studies on the physiological systems that depend on ion channel dynamics. I focus on the cyclic nucleotide-gated (CNG) channels implicated in the photoreceptor’s response to light. From an evolutionary standpoint, sensory systems offers interpretative advantages, as the relation between the sensory response and environment is relatively straightforward. For CNG and other ion channels, extensive data are available about the physiological consequences of scanning mutagenesis on sensitive protein domains, such as the conduction pore. Mutant ion channels can be easily studied in living cells, so that the relation between genotypes and phenotypes is less speculative than usual. By relying on relatively simple theoretical frameworks, I used these data to relate the sequence space with phenotypes at increasing hierarchical levels. These empirical genotype–phenotype and phenotype–phenotype landscapes became smoother at higher integration levels, especially in heterozygous condition. The epistatic interaction between sites was analyzed from double mutant constructs. Magnitude epistasis was common. Moreover, evidence of reciprocal sign epistasis and the presence of permissive mutations were also observed, which suggest how adaptive regions can be connected across maladaptive valleys. The approach I describe suggests a way to better relate the evolutionary dynamics with the underlying physiology.
Replication elongates short DNA, reduces sequence bias and develops trimer structure
2024, Nucleic Acids Research
Random and Natural Non-Coding RNA Have Similar Structural Motif Patterns but Differ in Bulge, Loop, and Bond Counts
2023, Life

View all citing articles on Scopus

View full text

On the structural repertoire of pools of short, random RNA sequences

Abstract

Introduction

Section snippets

Distribution and classification of secondary structures with isolated base pairs

Discussion

Programming and computational resources

Acknowledgements

Minimum sequence requirements for selective RNA-ligand binding: a molecular mechanics algorithm using molecular dynamics and free-energy techniques

J. Comp. Chem.

RNAs everywhere: genome-wide annotation of structured RNAs

J. Exp. Zool. (Mol. Dev. Evol.)

Isolation of new ribozymes from a large pool of random sequences

Science

Informational complexity and functional activity of RNA structures

J. Am. Chem. Soc.

Statistics of RNA secondary structures

Biopolymers

Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design

Nucleic Acids Res.

In vitro RNA random pools are not structurally diverse: a computational analysis

RNA

Analysis of RNA sequence structure maps by exhaustive enumeration

I. Neutral networks. Monatsh. Chem.

Analysis of RNA sequence structure maps by exhaustive enumeration. II Structures of neutral networks and shape space covering

Monatsh. Chem.

A story: unpaired adenosine bases in ribosomal RNAs

J. Mol. Biol.

The effect of RNA secondary structures on RNA-ligand binding and the modifier RNA mechanism: a quantitative model

Gene

RNA structural motifs: building blocks of a modular biomolecule

Quart. Rev. Biophys.

Using RNA secondary structures to guide sequence motif finding towards single-stranded regions

Nucleic Acids Res.

Fast folding and comparison of RNA secondary structures

Monatsh. Chem.

Combinatorics of RNA secondary structures

Discrete Appl. Math.

Improved predictions of secondary structures for RNA

Proc. Natl Acad. Sci. USA

Directed evolution of nucleic acid enzymes

Annu. Rev. Biochem.

Poly(a)-binding protein binds to a-rich sequences via RNA-binding domains 1+2 and 3+4

RNA Biol.

A computational proposal for designing structured RNA pools for in vitro selection of RNAs

RNA

Finding specific RNA motifs: function in a zeptomole world?

RNA

Abundance of correctly folded RNA motifs in sequence space calculated on computational grids

Nucleic Acids Res.

Poly(a)-binding protein binds to a-rich sequences via RNA-binding domains $1 + 2$ and $3 + 4$