On the structural repertoire of pools of short, random RNA sequences

https://doi.org/10.1016/j.jtbi.2008.02.018Get rights and content

Abstract

A detailed knowledge of the mapping between sequence and structure spaces in populations of RNA molecules is essential to better understand their present-day functional properties, to envisage a plausible early evolution of RNA in a prebiotic chemical environment and to improve the design of in vitro evolution experiments, among others. Analysis of natural RNAs, as well as in vitro and computational studies, show that certain RNA structural motifs are much more abundant than others, pointing out a complex relation between sequence and structure. Within this framework, we have investigated computationally the structural properties of a large pool (108 molecules) of single-stranded, 35 nt-long, random RNA sequences. The secondary structures obtained are ranked and classified into structure families. The number of structures in main families is analytically calculated and compared with the numerical results. This permits a quantification of the fraction of structure space covered by a large pool of sequences. We further show that the number of structural motifs and their frequency is highly unbalanced with respect to the nucleotide composition: simple structures such as stem-loops and hairpins arise from sequences depleted in G, while more complex structures require an enrichment of G. In general, we observe a strong correlation between subfamilies—characterized by a fixed number of paired nucleotides—and nucleotide composition. Our results are compared to the structural repertoire obtained in a second pool where isolated base pairs are prohibited.

Introduction

The distribution of RNA structural motifs within pools of random sequences is extremely heterogeneous, as theoretical studies and observation of natural secondary structures demonstrate (Fontana et al., 1993, Schuster et al., 1994). Knowledge of the relationship between sequence and structure space has a theoretical and practical relevance, among other reasons because structural diversity conditions the spectrum of different functionalities present in—and thus selectable from—a random pool of sequences (Lorsch and Szostak, 1994). The role played by parameters such as the sequence length (Sabeti et al., 1997) or the nucleotide composition (Knight et al., 2005, Kim et al., 2007) has been addressed as a way of modifying the functional diversity of random molecular ensembles. Two frequent goals of those studies are to maximize the structural diversity present in the pool and to enhance the presence of certain structures able to perform new functions (Wilson and Szostak, 1999, Gan et al., 2003).

Every RNA sequence can be mapped onto a secondary structure that corresponds to its minimum free energy folded state. The first mathematical studies on this correspondence readily revealed the huge degeneracy existing between the set of all sequences—genotype space, of magnitude 4n if n denotes the length of the sequences—and the set of their possible secondary structures—a first approximation to the phenotype space (Stein and Waterman, 1978, Waterman, 1978). Calculations based on the compatibility between sequences and structures yield estimates of the average number of sequences that fold into a secondary structure. If isolated pairs are allowed in the secondary structure, there are, on average, about 1.402n3/21.748n sequences of length n folding into each possible secondary structure (Stein and Waterman, 1978). However, this huge number is of little practical relevance in the light of empirical observations and computational results with random pools: the so-called common structures are typically many orders of magnitude more frequent than rare structures (Schuster et al., 1994, Grüner et al., 1996a, Joyce, 2004). While common structures are easily obtained, even in small populations, and do not depend strongly on the nucleotide composition, sequences folding into rare structures often need to be designed, for instance by means of inverse folding algorithms (Schuster et al., 1994, Hofacker et al., 1994).

A main concern of experimentalists seeking new ribozyme or aptamer activities is how to deviate the structural composition of the initial pools in the in vitro experiments from average expectations, thus enhancing for instance the presence of rare structures, or forcing the ensemble to be structurally biased towards specific common structures. One approach has been to maximize the length of the sequences in the starting pool in an attempt to increment the number of different motifs available (Bartel and Szostak, 1993). However, quantitative analyses have shown that long sequences offer little advantage to isolate simple motifs, and their effect might be even inhibitory (Sabeti et al., 1997). More recently, attention has focused on how the probability to obtain a fixed structural motif depends on the nucleotide composition (Knight et al., 2005, Kim et al., 2007). Interestingly, though increases in the size of the initial pool should imply an increase in the amount of different structures present, the dependence of structural diversity on population size has been rarely addressed. Furthermore, computational results indicate that the number of different major topological motifs present in the pool depends very weakly on its size (Gevertz et al., 2005). Modular evolution has been suggested as a plausible way to generate complex structures in a constructive way. This approach can be implemented either through the isolation of simple modules from random populations of short sequences, their directed modification and eventual combination (Sabeti et al., 1997), or as the selective evolution of populations towards specific modules, together with their ligation in suitable environments (Manrubia and Briones, 2007). The latter approach is of particular relevance at prebiotic stages, when the biochemical function had to emerge in an unsupervised way.

Though our knowledge of the genotype–phenotype map has expanded largely in the last three decades, our understanding still has to be improved in order to comprehend the multiple implications it has on evolution, on setting the conditions for further selection, and on the dynamic behavior of highly heterogeneous molecular populations. This is the main motivation to undertake the study here presented, where we fold 108 random sequences of length n=35nt and classify the obtained structures into main structure families. We find correlations between the frequency of certain structure families and the nucleotide composition, and conclude that rare structures are to be found far from the average composition. Hence, they could be enhanced by tuning the fraction of each nucleotide in the sequences. One of our main results concerns the high fraction of sequences folding into topologically simple structure families: most abundant motifs resulting from random polymerization could constitute simple building blocks able to combine into more complex structures.

Section snippets

Distribution and classification of secondary structures with isolated base pairs

In this section, we describe the results of the folding of 108 RNA molecules of length 35 nt consisting of random linear sequences composed of the four types of nucleotides A, C, G, and U. In this first part of the study we allow the presence of isolated base pairs in the secondary structure.

Discussion

The structural space of RNA is vastly smaller than its sequence space. In order to delve into the features of such degeneracy, we have folded in silico two pools of 108 random RNA sequences of length 35 nt and classified the obtained secondary structures—about 106—in roughly 20 structure families (see Methods and Table 1). We found that HPs are the most probable structures formed by a random RNA sequence of this length. In fact, more than half of the sequences in our pools fold in structures

Programming and computational resources

Simulations have been carried out at the Itanium II cluster of INTA (Instituto Nacional de Técnica Aeroespacial, Spain). For random number generation, we relied on the Mersenne Twister and Ziff's FSR4 algorithms as provided by GNU Scientific Library (GSL), Version 1.7 (see http://www.gnu.org/software/gsl). Although we fold 108 molecules, this represents only a very small fraction of the sequence space, formed by 4351021 sequences. The probability to have a sequence repeated is of order 10-13,

Acknowledgements

The authors wish to acknowledge the technical assistance of Ruth Lobo and Pilar Viñado with the computations carried out at the Itanium II cluster of INTA, and the support of Isidro Cano and Hewlett-Packard within the project OriGenes.

Author contributions. All authors conceived and designed the study. MS performed the numeric calculations, SCM the analytic calculations. All authors analysed the data and wrote the paper.

Funding. This work was supported by Ministerio de Educación y Ciencia

References (43)

  • P.C. Anderson et al.

    Minimum sequence requirements for selective RNA-ligand binding: a molecular mechanics algorithm using molecular dynamics and free-energy techniques

    J. Comp. Chem.

    (2006)
  • R. Backofen et al.

    RNAs everywhere: genome-wide annotation of structured RNAs

    J. Exp. Zool. (Mol. Dev. Evol.)

    (2007)
  • D.P. Bartel et al.

    Isolation of new ribozymes from a large pool of random sequences

    Science

    (1993)
  • J.M. Carothers et al.

    Informational complexity and functional activity of RNA structures

    J. Am. Chem. Soc.

    (2004)
  • W. Fontana et al.

    Statistics of RNA secondary structures

    Biopolymers

    (1993)
  • H.H. Gan et al.

    Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design

    Nucleic Acids Res.

    (2003)
  • J. Gevertz et al.

    In vitro RNA random pools are not structurally diverse: a computational analysis

    RNA

    (2005)
  • W. Grüner et al.

    Analysis of RNA sequence structure maps by exhaustive enumeration

    I. Neutral networks. Monatsh. Chem.

    (1996)
  • W. Grüner et al.

    Analysis of RNA sequence structure maps by exhaustive enumeration. II Structures of neutral networks and shape space covering

    Monatsh. Chem.

    (1996)
  • R.R. Gutell et al.

    A story: unpaired adenosine bases in ribosomal RNAs

    J. Mol. Biol.

    (2000)
  • J. Hackermüller et al.

    The effect of RNA secondary structures on RNA-ligand binding and the modifier RNA mechanism: a quantitative model

    Gene

    (2005)
  • D.K. Hendrix et al.

    RNA structural motifs: building blocks of a modular biomolecule

    Quart. Rev. Biophys.

    (2005)
  • M. Hiller et al.

    Using RNA secondary structures to guide sequence motif finding towards single-stranded regions

    Nucleic Acids Res.

    (2006)
  • I.L. Hofacker et al.

    Fast folding and comparison of RNA secondary structures

    Monatsh. Chem.

    (1994)
  • I.L. Hofacker et al.

    Combinatorics of RNA secondary structures

    Discrete Appl. Math.

    (1998)
  • J.A. Jaeger et al.

    Improved predictions of secondary structures for RNA

    Proc. Natl Acad. Sci. USA

    (1989)
  • G.F. Joyce

    Directed evolution of nucleic acid enzymes

    Annu. Rev. Biochem.

    (2004)
  • T. Khanam et al.

    Poly(a)-binding protein binds to a-rich sequences via RNA-binding domains 1+2 and 3+4

    RNA Biol.

    (2006)
  • N. Kim et al.

    A computational proposal for designing structured RNA pools for in vitro selection of RNAs

    RNA

    (2007)
  • R. Knight et al.

    Finding specific RNA motifs: function in a zeptomole world?

    RNA

    (2003)
  • R. Knight et al.

    Abundance of correctly folded RNA motifs in sequence space calculated on computational grids

    Nucleic Acids Res.

    (2005)
  • Cited by (45)

    • From genotypes to organisms: State-of-the-art and perspectives of a cornerstone in evolutionary dynamics

      2021, Physics of Life Reviews
      Citation Excerpt :

      The phenotype is defined as the minimum energy of a given configuration calculated from a contact potential between neighbouring (but not in the backbone) beads. Because RNA and HP models are relatively tractable, properties such as the distribution of the number of genotypes per phenotype [11,55,56], the phenotypic robustness and evolvability [57,58] (see Box 3.1) or the topological structure of neutral networks [48] could be systematically studied and compared [59]. Given the pivotal role proteins play in cellular processes, the protein sequence-to-structure map, of which the HP model constitutes the simplest realisation, is of great general interest [60–62].

    • On the nature and origin of biological information: The curious case of RNA

      2019, BioSystems
      Citation Excerpt :

      Reframing biological information in structural terms erases the misconceptions and hyper-astronomical impediments of improbability associated with Shannon information. Even for a given pool of random RNA sequences of a very short length a significant percentage will have the ability to foldup into some limited set of specific structures (Stich et al., 2008). Like arbitrary but exact nucleic acid sequences, capricious but well-defined three-dimensional structures, can be regarded as conveying information if with represents a specific case out of a realm of other possibilities.

    • Non-enzymatic recombination of RNA: Ligation in loops

      2018, Biochimica et Biophysica Acta - General Subjects
      Citation Excerpt :

      As mentioned above random recombination is not very useful in formation of long RNA structures so the concern is to find short RNA structures (up to 50 nts according to [12]) prone to recombination preferably occurring at certain sites resulting in formation of long RNA sequences. Prevalence of such structures is low however it gets higher the longer RNA fragment formed [35]. Thus it seems reasonable to “shift” the limit of the search space from 50 nts to, for instance, 100 nts.

    View all citing articles on Scopus
    View full text