On the structural repertoire of pools of short, random RNA sequences
Introduction
The distribution of RNA structural motifs within pools of random sequences is extremely heterogeneous, as theoretical studies and observation of natural secondary structures demonstrate (Fontana et al., 1993, Schuster et al., 1994). Knowledge of the relationship between sequence and structure space has a theoretical and practical relevance, among other reasons because structural diversity conditions the spectrum of different functionalities present in—and thus selectable from—a random pool of sequences (Lorsch and Szostak, 1994). The role played by parameters such as the sequence length (Sabeti et al., 1997) or the nucleotide composition (Knight et al., 2005, Kim et al., 2007) has been addressed as a way of modifying the functional diversity of random molecular ensembles. Two frequent goals of those studies are to maximize the structural diversity present in the pool and to enhance the presence of certain structures able to perform new functions (Wilson and Szostak, 1999, Gan et al., 2003).
Every RNA sequence can be mapped onto a secondary structure that corresponds to its minimum free energy folded state. The first mathematical studies on this correspondence readily revealed the huge degeneracy existing between the set of all sequences—genotype space, of magnitude if n denotes the length of the sequences—and the set of their possible secondary structures—a first approximation to the phenotype space (Stein and Waterman, 1978, Waterman, 1978). Calculations based on the compatibility between sequences and structures yield estimates of the average number of sequences that fold into a secondary structure. If isolated pairs are allowed in the secondary structure, there are, on average, about sequences of length n folding into each possible secondary structure (Stein and Waterman, 1978). However, this huge number is of little practical relevance in the light of empirical observations and computational results with random pools: the so-called common structures are typically many orders of magnitude more frequent than rare structures (Schuster et al., 1994, Grüner et al., 1996a, Joyce, 2004). While common structures are easily obtained, even in small populations, and do not depend strongly on the nucleotide composition, sequences folding into rare structures often need to be designed, for instance by means of inverse folding algorithms (Schuster et al., 1994, Hofacker et al., 1994).
A main concern of experimentalists seeking new ribozyme or aptamer activities is how to deviate the structural composition of the initial pools in the in vitro experiments from average expectations, thus enhancing for instance the presence of rare structures, or forcing the ensemble to be structurally biased towards specific common structures. One approach has been to maximize the length of the sequences in the starting pool in an attempt to increment the number of different motifs available (Bartel and Szostak, 1993). However, quantitative analyses have shown that long sequences offer little advantage to isolate simple motifs, and their effect might be even inhibitory (Sabeti et al., 1997). More recently, attention has focused on how the probability to obtain a fixed structural motif depends on the nucleotide composition (Knight et al., 2005, Kim et al., 2007). Interestingly, though increases in the size of the initial pool should imply an increase in the amount of different structures present, the dependence of structural diversity on population size has been rarely addressed. Furthermore, computational results indicate that the number of different major topological motifs present in the pool depends very weakly on its size (Gevertz et al., 2005). Modular evolution has been suggested as a plausible way to generate complex structures in a constructive way. This approach can be implemented either through the isolation of simple modules from random populations of short sequences, their directed modification and eventual combination (Sabeti et al., 1997), or as the selective evolution of populations towards specific modules, together with their ligation in suitable environments (Manrubia and Briones, 2007). The latter approach is of particular relevance at prebiotic stages, when the biochemical function had to emerge in an unsupervised way.
Though our knowledge of the genotype–phenotype map has expanded largely in the last three decades, our understanding still has to be improved in order to comprehend the multiple implications it has on evolution, on setting the conditions for further selection, and on the dynamic behavior of highly heterogeneous molecular populations. This is the main motivation to undertake the study here presented, where we fold random sequences of length and classify the obtained structures into main structure families. We find correlations between the frequency of certain structure families and the nucleotide composition, and conclude that rare structures are to be found far from the average composition. Hence, they could be enhanced by tuning the fraction of each nucleotide in the sequences. One of our main results concerns the high fraction of sequences folding into topologically simple structure families: most abundant motifs resulting from random polymerization could constitute simple building blocks able to combine into more complex structures.
Section snippets
Distribution and classification of secondary structures with isolated base pairs
In this section, we describe the results of the folding of RNA molecules of length 35 nt consisting of random linear sequences composed of the four types of nucleotides A, C, G, and U. In this first part of the study we allow the presence of isolated base pairs in the secondary structure.
Discussion
The structural space of RNA is vastly smaller than its sequence space. In order to delve into the features of such degeneracy, we have folded in silico two pools of random RNA sequences of length 35 nt and classified the obtained secondary structures—about —in roughly 20 structure families (see Methods and Table 1). We found that HPs are the most probable structures formed by a random RNA sequence of this length. In fact, more than half of the sequences in our pools fold in structures
Programming and computational resources
Simulations have been carried out at the Itanium II cluster of INTA (Instituto Nacional de Técnica Aeroespacial, Spain). For random number generation, we relied on the Mersenne Twister and Ziff's FSR4 algorithms as provided by GNU Scientific Library (GSL), Version 1.7 (see http://www.gnu.org/software/gsl). Although we fold molecules, this represents only a very small fraction of the sequence space, formed by sequences. The probability to have a sequence repeated is of order ,
Acknowledgements
The authors wish to acknowledge the technical assistance of Ruth Lobo and Pilar Viñado with the computations carried out at the Itanium II cluster of INTA, and the support of Isidro Cano and Hewlett-Packard within the project OriGenes.
Author contributions. All authors conceived and designed the study. MS performed the numeric calculations, SCM the analytic calculations. All authors analysed the data and wrote the paper.
Funding. This work was supported by Ministerio de Educación y Ciencia
References (43)
- et al.
Minimum sequence requirements for selective RNA-ligand binding: a molecular mechanics algorithm using molecular dynamics and free-energy techniques
J. Comp. Chem.
(2006) - et al.
RNAs everywhere: genome-wide annotation of structured RNAs
J. Exp. Zool. (Mol. Dev. Evol.)
(2007) - et al.
Isolation of new ribozymes from a large pool of random sequences
Science
(1993) - et al.
Informational complexity and functional activity of RNA structures
J. Am. Chem. Soc.
(2004) - et al.
Statistics of RNA secondary structures
Biopolymers
(1993) - et al.
Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design
Nucleic Acids Res.
(2003) - et al.
In vitro RNA random pools are not structurally diverse: a computational analysis
RNA
(2005) - et al.
Analysis of RNA sequence structure maps by exhaustive enumeration
I. Neutral networks. Monatsh. Chem.
(1996) - et al.
Analysis of RNA sequence structure maps by exhaustive enumeration. II Structures of neutral networks and shape space covering
Monatsh. Chem.
(1996) - et al.
A story: unpaired adenosine bases in ribosomal RNAs
J. Mol. Biol.
(2000)
The effect of RNA secondary structures on RNA-ligand binding and the modifier RNA mechanism: a quantitative model
Gene
RNA structural motifs: building blocks of a modular biomolecule
Quart. Rev. Biophys.
Using RNA secondary structures to guide sequence motif finding towards single-stranded regions
Nucleic Acids Res.
Fast folding and comparison of RNA secondary structures
Monatsh. Chem.
Combinatorics of RNA secondary structures
Discrete Appl. Math.
Improved predictions of secondary structures for RNA
Proc. Natl Acad. Sci. USA
Directed evolution of nucleic acid enzymes
Annu. Rev. Biochem.
Poly(a)-binding protein binds to a-rich sequences via RNA-binding domains and
RNA Biol.
A computational proposal for designing structured RNA pools for in vitro selection of RNAs
RNA
Finding specific RNA motifs: function in a zeptomole world?
RNA
Abundance of correctly folded RNA motifs in sequence space calculated on computational grids
Nucleic Acids Res.
Cited by (45)
From genotypes to organisms: State-of-the-art and perspectives of a cornerstone in evolutionary dynamics
2021, Physics of Life ReviewsCitation Excerpt :The phenotype is defined as the minimum energy of a given configuration calculated from a contact potential between neighbouring (but not in the backbone) beads. Because RNA and HP models are relatively tractable, properties such as the distribution of the number of genotypes per phenotype [11,55,56], the phenotypic robustness and evolvability [57,58] (see Box 3.1) or the topological structure of neutral networks [48] could be systematically studied and compared [59]. Given the pivotal role proteins play in cellular processes, the protein sequence-to-structure map, of which the HP model constitutes the simplest realisation, is of great general interest [60–62].
On the nature and origin of biological information: The curious case of RNA
2019, BioSystemsCitation Excerpt :Reframing biological information in structural terms erases the misconceptions and hyper-astronomical impediments of improbability associated with Shannon information. Even for a given pool of random RNA sequences of a very short length a significant percentage will have the ability to foldup into some limited set of specific structures (Stich et al., 2008). Like arbitrary but exact nucleic acid sequences, capricious but well-defined three-dimensional structures, can be regarded as conveying information if with represents a specific case out of a realm of other possibilities.
Non-enzymatic recombination of RNA: Ligation in loops
2018, Biochimica et Biophysica Acta - General SubjectsCitation Excerpt :As mentioned above random recombination is not very useful in formation of long RNA structures so the concern is to find short RNA structures (up to 50 nts according to [12]) prone to recombination preferably occurring at certain sites resulting in formation of long RNA sequences. Prevalence of such structures is low however it gets higher the longer RNA fragment formed [35]. Thus it seems reasonable to “shift” the limit of the search space from 50 nts to, for instance, 100 nts.
Empirically founded genotype-phenotype maps from mammalian cyclic nucleotide-gated ion channels
2014, Journal of Theoretical BiologyReplication elongates short DNA, reduces sequence bias and develops trimer structure
2024, Nucleic Acids Research