Introduction

It is accepted now that many biologically active proteins do not have a unique 3-D structure as a whole or in part [15]. These intrinsically disordered proteins (IDPs) and intrinsically disordered protein regions (IDPRs) possess highly flexible structures and exist as conformational dynamic ensembles characterized by different degree and depth of disorderedness [2, 4, 610]. IDPs/IDPRs are highly abundant in virtually any given proteome [1, 3, 5, 11]. Biological functions of IDPs, which are typically involved in regulation, signaling, and control pathways [1214], represent a crucial complementation to the functional repertoire of ordered proteins [1518].

Intrinsic disorder was shown to be very common in RNA- and DNA-binding proteins [4, 8, 9, 19]. The results of the analysis of the Saccharomyces genome suggested that proteins containing disorder are over-represented in the cell’s nucleus and are likely to be involved in the regulation of transcription and cell signaling [3]. Systematic bioinformatics studies revealed a significant prevalence of intrinsic disorder in transcription factors [2022]. For example, analysis of 401 human transcription factors showed that IDPRs occupy ~50 % of the entire sequence of human transcription factors [22].

Multiple functions are associated with the RNA-binding proteins, which are believed to determine RNA fate from synthesis to decay [23]. For example, intrinsically disordered C-terminal domain allows La protein to interact productively with a diversity of noncoding RNA precursors, protect these RNAs from nucleases and affect folding, maturation, and ribonucleoprotein assembly [24]. Other intrinsically disordered RNA-binding proteins often act as specific RNA chaperones, assisting in the structural rearrangements of RNA molecules [25]. An illustrative example of such disordered RNA chaperones are viral core proteins from different Flaviviridae genera [26], bunyavirus nucleocapsid protein [27], hantavirus nucleocapsid protein [28], and potentially core proteins of Pestivirus [29].

In addition to the RNA chaperone activity, many RNA-binding proteins possess a multitude of intrinsic disorder-dependent functions. For example, serine/arginine-rich (SR) splicing factors that play an important role during several steps of RNA metabolism and are involved in constitutive and alternative splicing, were shown to be IDPs [30]. Intrinsic disorder in a small RNA-binding protein, the HIV-1 transcriptional regulator Tat, is essential for viral gene expression and replication, as well as for the ability of Tat to interact with a large number of proteins within both infected and non-infected cells [31, 32]. The intrinsically disordered SARS-CoV nucleocapsid protein binds to the viral RNA genome, forms the ribonucleoprotein core, and is involved in several important functions in the viral life cycle [33]. The intrinsic disorder is used by the stem-loop binding protein (SLBP) for the regulation of histone mRNAs, since the disordered N-terminal domain of SLBP contains signals for mRNA translation and histone mRNA import [34]. Intrinsic disorder in SBP2, which is the SECIS Binding Protein 2 that specifically interacts with a stem-loop structure in the 3′ UTR RNA (the SECIS element), is important for the co-translational incorporation of selenocysteine into selenoproteins at a reprogrammed UGA codon [35].

The ribosome is a large ribonucleoprotein catalyzing protein translation. Although the ribosomes are responsible for the synthesis of proteins across all kingdoms of life, and although their core functions are mRNA decoding and catalysis of the peptide bond formation [36], other translation-related processes (such as initiation, termination, and regulation) are quite different in different domains of life [37, 38]. Since the eukaryotic ribosomes are directly involved in many eukaryote-specific cellular processes, they are at least 40 % larger than their bacterial counterparts due to the presence of additional ribosomal RNA (rRNA) elements called expansion segments and extra ribosomal proteins [39]. In prokaryotes, there are 70S ribosomes, with small and large subunits of 30S and 50S, respectively. The small 30S subunit contains a 16S ribosomal RNA (rRNA) and 21 proteins, whereas in the large 50S subunit there are two rRNAs (5S and 23S) and 31 proteins. The eukaryotic 80S ribosome consists of a small (40S) and a large (60S) subunit. In the 40S small subunit, there is a single 18S rRNA and 33 proteins. The eukaryotic 60S subunit is composed of three rRNAs (5S rRNA, 28S rRNA, and 5.8S rRNA) and 46 proteins [40]. Of the 79 eukaryotic ribosomal proteins, 32 have no homologs in the bacterial or archaeal ribosomes, and those that do have homologs possess long eukaryote-specific extensions [41].

Ribosomal proteins represent an interesting and important category of RNA-binding IDPs due to their unique functional and structural properties. In addition to being a crucial part of the ribosome, many ribosomal proteins are involved in translational regulation via binding to operator sites located on their own messenger RNA [42]. Based on the analysis of the crystal structures of the ribosome subunits, it was discovered that almost half of the ribosomal proteins have globular domains with long extensions that penetrate deeply into the ribosome particle’s core [4350]. It was indicated that these extensions are disordered in solution but still play a key role in ribosomal assembly [49, 5153]. In fact, the hypothesis is that the long basic extensions of ribosomal proteins (e.g., L3, L4, L13, L20, L22, and L24) can penetrate deeply into the ribosome subunit cores, undergo disorder-order transition individually or co-fold with their RNA, therefore facilitating the proper rRNA folding [49]. It was also indicated that different extensions do not play a similar role in the assembly of the ribosome subunits in vivo and might have some other functions [49].

Although the fact that in their non-bound forms many ribosomal proteins are either completely disordered or contain long disordered regions has been known for a long time (e.g., ribosomal proteins were included in the early bioinformatics studies dedicated to the sequence peculiarities [4] and functional repertoire of IDPs [19]), the abundance and functional roles of intrinsic disorder in these proteins have never been the subject of focused large-scale bioinformatics analysis. Our study fills this gap by reporting the results of the bioinformatics analysis of 3,411 ribosomal proteins from 32 species. We show here that intrinsic disorder is very common among all the analyzed ribosomal proteins, that it has unique characteristics which differentiate it from the disorder in other RNA- and DNA-binding proteins, and that it plays a role in the various functions of these important RNA-binding proteins.

Materials and methods

Dataset of ribosomal proteins

We collected 3,438 proteins from the Ribosomal Protein Gene Database (RPG) [54] on Nov 7, 2011. This set includes proteins from 24 species in Eukaryota, four in Archaea, and four in Bacteria, respectively. We excluded 27 small peptides with <30 amino acids because they could not be predicted by MFDp [55]. The final dataset, named RPG_3411, is summarized in Table S1.

Datasets of RNA- and DNA-binding proteins

We also collected a representative subset of RNA- and DNA-interacting proteins from a current release of UniProt [56] for the same set of species as in the RPG dataset. Next, for each species we selected at random a subset of RNA- and DNA-interacting proteins to match the number of ribosomal chains. The corresponding sets of DNA-binding, RNA-binding, and the ribosomal proteins are summarized in Table S1. This allowed us to represent a wide spectrum of the nucleic acid interacting chains while keeping the dataset sizes at a level that allows complete computational analysis. The combined set of RNA/DNA-binding chains includes 3,084 proteins; this number is slightly lower than the size of RPG set since some proteins interact with both RNA and DNA and a couple of species (Fusarium graminearum and Rhizopus oryzae) had fewer DNA/RNA–interacting proteins annotated in UniProt than the corresponding number of ribosomal chains in the RPG dataset.

Evaluation of the surface and interface areas

The solvent-accessible surface area (ASA) for all the ribosomal proteins of the eukaryotic ribosome (PDB ID: 3U5C and 3U5E [57]) was calculated using an in-house program based on the double cubic lattice algorithm [58] as implemented in the BALL library [59]. The ASA of a protein is calculated with a probe radius of 1.4 Å. The interface area buried by a complex is defined as the difference between the surface area of the complex and the sum of the surface areas of two partners, where the indicated chain is considered as one partner and the remainder of the subunit (including the rRNA) is taken as the other partner: interface ASA = ASApartner 1 + ASApartner 2 − ASAcomplex. As observed by a reviewer, the ASA of the bound structures of IDPRs are not a measure of the ASA of free IDPRs. Nonetheless, these calculations are useful in distinguishing the unbound order/disorder state of components of a complex structure using the Nussinov’s plot [60].

Nussinov’s plot

According to Gunasekaran et al. [60], the per-residue ASA versus per-residue interface ASA clearly distinguishes between the two classes of proteins, with monomers in the two-state complexes being characterized by extended shapes and larger interface areas, and with monomers in the three-state complexes being more globular and compact. See text below for explanation of two-state and three-complexes. In fact, in the per-residue ASA versus the per-residue interface ASA plot (Nussinov’s plot), the two-state and three-state complexes occupy very different areas, with the disordered proteins (that form complexes in a two-state mechanism) being distributed sparsely over a broad area in the top-right part of the plot, suggesting that disordered proteins opt for extended shapes and larger interface areas, and with ordered proteins (that from complexes in a three-state mechanism) being condensed in the small area at the bottom-left corner of the plot, suggesting that these proteins are more globular and compact in their bound form [60]. Furthermore, it was also pointed out that since the maxima of per-residue surface and interface areas for stable monomers lie around 80 Å2, the line connecting these two extreme values in the per-residue surface area versus the per-residue interface area plot represents a natural boundary separating ordered and disordered proteins forming three-state and two-state complexes, respectively [60]. Here, ordered proteins were systematically located below this boundary, and the disordered proteins were widely spread above the boundary [60].

Identification of likely disorder-to-order transition regions

The Nussinov plot is useful when the proteins of a complex are completely ordered or disordered, but can give ambiguous results when proteins contain both ordered and disordered regions. For these structures, a method to segment each protein of a complex into likely ordered and likely disordered segments would resolve the ambiguity. We base such a method on a similar principle used for the Nussinov plot, the complex structures of IDPRs will have a higher ASA than the structures of ordered regions. The idea behind the method is framed in terms of structural context: a residue with a low ASA is likely in a context in which it is folded and a residue with a high ASA has likely been removed from a context in which it folds. In context (IC) and out of context (OC) residues were modeled using a discrete finite automaton (DFA) with two states. Each state is characterized by the emission probability distributions of the ASA of each residue type—alanine, cysteine, aspartic acid, etc. The ASA distribution of IC residues was calculated directly from a sequence unique set of 4,725 monomer X-ray structures from PDB. The ASA distribution of OC residues was estimated from the same set of structures, but considering only a short sequence window around each residue when calculating the ASA, i.e., the ASA of each residue is calculated out of the context of the monomer structure. ASA distributions were discretized using the method of Fayyad and Irani [61]. A window size of 11 was selected based on convergence of the IC and OC distributions with varying window size (data not shown). Transition probabilities for the DFA were selected to correspond with an average IC region length of 200 residues and an average OC region length of 20 residues. Classification of IC/OC was made by calculating the OC posterior probability using the forward/backward algorithm. For ribosomal proteins, posteriors were calculated from ASAs calculated on the isolated protein structures.

Amino acid composition analysis

Amino acid compositional analysis was carried out using Composition Profiler [62] (http://www.cprofiler.org) using the PDB Select 25 [63] and the DisProt [64] datasets as reference for ordered and disordered proteins, respectively. Enrichment or depletion in each amino acid type was expressed as (C x  − C order)/C order, i.e., the normalized excess of a given residue’s content in a query dataset (C x ) relative to the corresponding value in the dataset of ordered proteins (C order).

Search for potential globular domains in 3,438 ribosomal proteins

Potential globular domains in ribosomal proteins were identified using the GlobPlot server (http://globplot.embl.de/), which is a popular predictor based on a running sum of the propensity for amino acids to be in an ordered or disordered state [65]. GlobPlot is a computationally efficient Web service that allows the user to plot the tendency within the query protein for order/globularity and disorder [65] and was recently evaluated to provide competitive predictive performance [66].

Computational evaluation of disorder

The disorder was predicted with MFDp method [55], which is a consensus-based predictor that was recently shown to provide strong and competitive predictive quality [67, 68]. MFDp predictions were used to calculate the disorder content (fraction of disordered residues), the number of disordered segments, and the number of long disordered segments that consists of at least 30 consecutive disordered amino acids; such long segments were found to be implicated in protein–protein recognition [69]. We only counted the disordered segments with at least four consecutive disordered residues, which is consistent with other reports [67, 70]. We also assumed that a given domain is considered to be disordered if it includes at least one disordered region with at least four consecutive disordered residues, and to be significantly disordered if at least half of its residues are disordered.

We also used the DisCon method [71] to predict the overall content (fraction of the disordered residues) in the protein chains. DisCon provides more accurate disorder content predictions when compared with MFDp and several other recent disorder predictors [71], but it does not predict the disorder at the residue level, contrary to MFDp. The residue-level predictions allow for a more insightful analysis, including an investigation into the number and size of the predicted disordered segments. In addition to DisCon, two binary disorder classifiers, charge-hydropathy (CH) plot [4, 72] and cumulative distribution function (CDF) plot [72, 73], as well as their combination known as CH–CDF analysis [7375], were used.

Search for potential functional sites

We predicted function of the disordered segments based on a local pairwise alignment against functionally annotated disordered segments collected from DisProt 5.9 [64]. We aligned each of the 7,548 disordered segments extracted from the RPG_3411 dataset into a set of 775 disordered segments collected from DisProt database that have functional annotation. We calculated alignment using the Smith–Waterman algorithm [76] using the EMBOSS implementation with default parameters (gap_open = 10, gap_extend = 0.5, and blosum62 matrix). We defined sequence similarity as the number of identical residues in the local alignment divided by the length of the local alignment or the length of the shorter of the two being aligned segments, whichever is larger. We transferred the annotation if the similarity was >0.8; this means that some of the segments may be annotated with multiple functions. The value of the threshold was chosen to assume high similarity even in cases of alignment to a short segment, i.e., for the shortest segments of five residues at least four amino acids have to be matched. Consequently, we successfully annotated 911 disordered segments with 26 functions that are listed in Table S2. These annotations were used to discuss difference of the functional roles between short and long disordered segments in the ribosomal proteins.

We used MoRFpred method [77], which is a leading predictor of molecular recognition features (MoRF), to annotate MoRF regions. MoRFs are short (5–25 amino acids) disordered regions with which undergo disorder-to-order transition upon binding to protein partners and are implicated in signaling and regulatory functions [2, 7880]. Following Mohan et al. [80], we grouped MoRF regions into α-MoRFs (that fold into α-helices), β-MoRFs (that fold into β-strands), γ-MoRFs (coils) and complex-MoRFs (mixture of different secondary structure), based on the secondary structure predicted with PSI-PRED [81].

Calculation of sequence conservation

We also report sequence conservation for the ordered residues, the disordered residues and the residues in long (with at least 30 consecutive disordered amino acids) disordered segments. The conservation was quantified with relative entropy [82] that was calculated from the Weighted Observed Percentages (WOP) profiles generated by PSI-BLAST [83]. PSI-BLAST was run with default parameters (-j 3, -h 0.001) against the NCBI’s non-redundant (nr) protein database, which was filtered using PFILT [84] to remove low-complexity regions, trans-membrane regions and coiled-coil regions. The use of the relative entropy is motivated by work in [82] that suggests that it leads to more biologically relevant results compared to some other conservation scores and the fact that it was recently applied to investigate disorder in histones [85] and to identify nucleotide-binding residues [86] and catalytic sites [87].

Results

Abundance of intrinsic disorder in ribosomal proteins as evidenced from the crystal structure of the eukaryotic ribosome

Bioinformatics analysis of the full-length ribosomal proteins from Saccharomyces cerevisiae

Figure 1a represents the results of the computational disassembly of protein components of the eukaryotic ribosome from the yeast S. cerevisiae and shows that the complex structure of this important nucleoprotein relies on the intrinsic disorder of ribosomal proteins. In fact, even simple visual inspection of the individual ribosomal proteins clearly shows that almost all of them possess very unusual shapes, which are not consistent with simple globular structure. These peculiar shapes suggest that many ribosomal proteins form the so-called two-state (or disordered) complexes, where the monomers unfold upon complex separation. Therefore, individual chains in such complexes are disordered in their unbound forms and fold at complex formation. This behavior is different from that of the so-called three-state (or ordered) complexes, individual chains of which are independently folded even in the unbound state [88, 89].

Fig. 1
figure 1

a Computational disassembly of the eukaryotic ribosome from the yeast Saccharomyces cerevisiae (PDB ID: 3U5C and 3U5E; [57]). Structure of the proteinaceous component of the ribosome is shown at the center of the plot as a large complex, and structures of the individual ribosomal proteins are positioned around this central complex. The figure clearly shows that there are almost no ribosomal proteins with simple globular shape, and many of them contain long protrusions or extensions. b Plot of per-residue surface versus per-residue interface areas. Surface and interface area normalized by the number of residues in each chain for the ribosomal proteins were estimated as described in [60]. Proteins of the 40S and 60S subunits are shown by red and blue circles, respectively. A boundary separating ordered and disordered complexes is shown as a black dashed line

As it was mentioned, Nussinov’s plot, where the per-residue surface area is plotted versus per-residue interface area for protein complexes, can distinguish between these two classes of proteins, with monomers in the two-state complexes being characterized by extended shapes and larger interface areas, and with monomers in the three-state complexes being more globular and compact [60]. In fact, the two-state and three-state complexes occupy very different areas in the Nussinov’s plot, with the disordered proteins (that form complexes in a two-state mechanism) being distributed sparsely over a broad area in the top-right part of the plot (above the boundary), suggesting that disordered proteins opt for extended shapes and larger interface areas, and with ordered proteins (that from complexes in a three-state mechanism) being condensed in the small area at the bottom-left corner of the plot (below the boundary, suggesting that these proteins are more globular and compact in their bound form [60].

In agreement with these observations, Fig. 1b shows that almost all ribosomal proteins from the eukaryotic ribosome are located above the order–disorder boundary suggested by Gunasekaran et al. [60]. There are only two clear exceptions to this rule, the protein RACK1 found in the small ribosomal subunit and the ribosomal protein L11 of the large subunit. Five more proteins touch the boundary, with two proteins from the 60S subunit, L3 and L9, being located slightly below the line, and three proteins (L23-A, S1-A, and S12) being found right above the boundary. It is important to note here that although RACK1 is considered to be a component of the small (40S) ribosomal subunit S. cerevisiae, it is not a typical ribosomal protein, being classified as 40S-associated protein. In fact, RACK1 is the guanine nucleotide-binding protein subunit β-like protein, also known as the receptor of activated protein kinase C1 RACK1. This protein is located at the head of the 40S ribosomal subunit in the vicinity of the mRNA exit channel [90]. It acts as a scaffold protein recruiting some other proteins to the ribosome and is involved in the negative regulation of translation of a specific subset of proteins [90]. Since the absolute majority of the yeast ribosomal proteins is located above the boundary of the Nussinov’s plot, these observations suggest that almost all of them belong to the category of proteins participating in the formation of two-state complexes. In other words, the vast majority of ribosomal proteins are mostly unstructured in their unbound state but fold to a different degree upon the ribosome formation. In fact, the hypothesis on the mostly unfolded nature of unbound ribosomal proteins is in agreement with earlier experimental studies which showed that many individual ribosomal proteins do not possess ordered structure in their non-bound forms or at least contain long disordered regions [4, 91104]. The conclusion on the different degree of folding in bound state follows from the visual inspection of protein structures shown in Fig. 1a suggesting that many ribosomal proteins are folded to different degrees and possess both globular and non-globular domains in their bound forms (see below for more detailed analysis of this phenomenon). Furthermore, analysis of the yeast ribosome crystal structure revealed that many ribosomal proteins contained long stretches of residues with missing electron density. These regions of missing electron density correspond to protein segments that retain high conformational flexibility in their bound forms precluding them from being detected in the crystallography experiments. Some of these regions with missing electron density, which can be found in REMARK 465: MISSING RESIDUES section of corresponding PDB entries, are (in the 40S subunit of the ribosome, PDB ID: 3U5C): residues 208–252 in S0-A, residues 1–19 and 334–355 in S1-A, residues 1–33 and 251–254 in S2, residues 226–240 in S3, residues 1–19 in S5, residues 227–236 in S6-A, residues 124–134 in S8-A, residues 187–197 in S9-A, residues 1–19 in S12, residues 1–10 in S14-A, residues 1–7 and 132–142 in S15, residues 90–94 and 127–136 in S17-A, residues 1–14 in S20, residues 1–35 and 106–107 in S25-A, residues 99–119 in S26-A, residues 1–81 in S31, residues 1–8 and 142–273 in suppressor protein STM1. In the 60S subunit of the yeast ribosome, PDB ID: 3U5E, the proteins with long regions of missing electron density are L6-A (residues 110–128), L7-A (residues 1–22), L8-A (residues 1–23), L10 (residues 103–111), L22-A (residues 1–8 and 109–121), L24-A (residues 99–155), L25 (residues 1–21), L30 (residues 1–8), L34-A (residues 114–121), and L40 (residues 1–76).

Identification of likely disorder-to-order transitioning regions within the ribosomal proteins from S. cerevisiae

Visual analysis of individual ribosomal proteins in Fig. 1a reveals that many of these proteins have a structured (often globular) domain that might fold independently to binding to the rRNA or other ribosomal proteins and also possess long non-globular domains that are used for interactions with binding partners too. To find how this morphological heterogeneity might affect disorder-to-order transitions, we put together a statistical method for separating extended and collapsed regions based on accessible surface area analysis. The method is based on a discrete finite automaton (DFA) with two states, where one modeled on residues from intact proteins and another modeled on residues from local fragments (see “Materials and methods”). Each residue type is treated separately. The DFA analysis provided a probability that each residue is in/out of context (IC/OC; i.e., the probability that the residues is included or not included in globular structure) and all the ribosome proteins were split into IC and OC residues. Figure 2a represents results of this analysis by showing all 60S proteins with OC are mapped to radius and color. Here, color and width of ribbon corresponds to the OC posterior probability, where regions with a high probability are red and wide and regions with a low probability are blue and thin. This figure agrees well with other data and shows that many ribosomal proteins has long regions with OC residues; i.e., regions not involved in globular structures. Next, we calculated the Nussinov’s plot for each set of residues separately for each protein. The results of this analysis are shown in Fig. 2b, where data for IC and OC regions of 40S (circles) and 60S (squares) ribosomal proteins are shown by blue and red symbols, respectively. Figure 2b illustrates that all OC regions are clearly disordered in their unbound state and undergo binding-induced folding. Also, many globular domains are disordered when unbound. Although many IC regions seem to be ordered prior to binding, the vast majority of points corresponding to these regions/domains are clustered in close proximity to the order–disorder boundary suggested by Gunasekaran et al. [60]. Therefore, the results of these analyses suggest that many ribosomal proteins are entirely disordered in the unbound form and a noticeable portion of their globular domains is formed as a result of binding to rRNA or other ribosomal proteins.

Fig. 2
figure 2

Foldability of globular and extended domains of ribosomal proteins from the yeast Saccharomyces cerevisiae. a Worm representation of 60S proteins. Color and width of ribbon corresponds to the OC posterior probability, where regions with a high probability are red and wide and regions with a low probability are blue and thin. b Nussinov’s plot of ΔASA against the ASA for the IC (blue) and OC (red) residues of 40S (circles) and 50S (squares) proteins

Contact order analysis of the ribosomal proteins from S. cerevisiae

Figure 3 represents the results of the contact order analysis of the conformations adopted by ribosomal proteins in their bound states. The contact order values were computed for proteins from the eukaryotic ribosome (PDB IDs: 3U5C and 3U5E) based on a recent definition of the residue–residue contacts [105], where two residues are assumed in contact if their C β atoms (except for G where we use Cα atoms) are separated by <8Å. The plot shown in Fig. 3 is the asymmetric bimodal distribution with the bigger peaks corresponding to the structures with lower contact order (in the ranges of 0.05–0.10 and 0.10–0.15 for the small and large ribosomal subunits, respectively) and much smaller peaks corresponding to the structures with the relatively high contact order (in the range of 0.20–0.25). One should remember that the low contact order values could be indicative of an elongated structure or low density packing of residues in a globular structure. However, the analysis of structures of the eukaryotic ribosomal proteins with low contact order clearly shows that they possess highly extended structures (e.g., chains R and b of the 60S subunit and chains e and h of the 40S subunit) or have highly asymmetric hybrid structures containing relatively small globular domains and disproportionally long extended regions (chain f of the 40S ribosomal subunit). On the other hand, proteins with high contact order are characterized by the presence of large globular domains and short extended protrusions.

Fig. 3
figure 3

Contact order values for proteins from the eukaryotic ribosome (PDB IDs: 3U5C and 3U5E). The figure includes three distributions of the contact order values: for all chains combined (black line), for 3U5C (green line), and for 3U5E (red line). The chains’ identifiers from these proteins that have contact order values in a given interval are listed above the x-axis. Illustrative examples of structures of the ribosomal proteins with low contact order [chains e, f, and h in the crystal structure of the 40S subunit (PDB ID: 5U3C), and chains R and b in the crystal structure of the 60S subunit (PDB ID: 3E5E)] and the ribosomal proteins with relatively high contact order [chains U and c in the crystal structure of the 40S subunit (PDB ID: 5U3C), and chains c, d, f, and o in the crystal structure of the 60S subunit (PDB ID: 3E5E)] are shown on the sides of the plot

Some peculiarities of the amino acid compositions of ribosomal proteins

Amino acid compositions of the full-length ribosomal proteins

Analysis of the amino acid composition biases can provide interesting information on the nature of a protein. For example, the amino acid compositions of extended IDPs are characterized by some global biases, where low mean hydropathy is combined with high mean net charge. These global biases determine the highly unstructured and extended state of these proteins, since high net charge leads to strong electrostatic repulsion, and low hydropathy prevents efficient compaction [4]. In agreement with these global observations, IDPs were shown to be significantly depleted in so-called order-promoting amino acids, C, W, I, Y, F, L, H, V, and N, and substantially enriched in disorder-promoting residues, A, G, R, T, S, K, Q, E, and P [8, 15, 62, 106, 107]. We use a computational tool, Composition Profiler [62], to investigate the compositional biases in ribosomal proteins. This approach is based on the calculation of a normalized composition of a given protein or protein dataset in the (C s − C order)/C order form, where C s is a content of a given residue in a query (ribosomal) protein or dataset, and C order is the corresponding value for the set of ordered proteins from PDB Select 25 [63]. Figure 4a shows that, in comparison with typical ordered proteins, ribosomal proteins from all three domains of life are depleted in the major order-promoting amino acids, C, W, F, Y, L, V, H, and N, and are enriched in some disorder-promoting residues, particularly R, K, G (except for eukaryotic ribosomal proteins), A (except for archaeal ribosomal proteins), and E (except for eukaryotic ribosomal proteins). Obviously, the enrichment in positively charged R and K residues is determined by the functional need for the ribosomal proteins to interact with negatively charged rRNA. This high lysine-arginine content also defines the unusually high pI values reported for the majority of the ribosomal proteins (average pI ~ 10.1). Overall, the pronounced depletion in bulky hydrophobic and aromatic amino acids and enrichment in polar and charge residues may define the low propensity of ribosomal proteins for autonomous (or partner-independent) folding. On the other hand, there are several interesting compositional biases for the ribosomal proteins that differentiate them from the typical IDPs. These biases include some enrichment in the order-promoting amino acids I and V, and the noticeable depletion in the content of disorder-promoting residues T, D, Q, and S.

Fig. 4
figure 4

Fractional difference in the amino acid composition between the different members of the family of ribosomal proteins from Bacteria (green bars), Archaea (red bars), and Eukaryota (yellow bars) and a set of completely ordered proteins calculated for each amino acid residue (compositional profiles). The fractional differences were evaluated for the full-length ribosomal proteins (a) and for extended (b) and globular domains (c). The fractional difference was calculated as (C x  − C order)/C order, where C x is the content of a given amino acid in a query set, and C order is the corresponding content in the dataset of fully ordered proteins. Composition profile of typical intrinsically disordered proteins from the DisProt database is shown for comparison (black bars). Positive bars correspond to residues found more abundantly in ribosomal proteins, whereas negative bars show residues, in which ribosomal proteins are depleted. Amino acid types were ranked according to their increasing disorder-promoting potential [15]. Panel d shows enrichment of amino acid M in the functions assumed by disordered regions that are considered in this work. We considered 26 functions from Table S2 that were annotated using DisProt database (as explained in the “Materials and methods” section); to assure statistically sound results 13 functions that have at least 20 annotated segments are shown. The fractional difference was calculated for M for the 13 functions that are sorted alphabetically on the x-axis. Positive bars correspond to function (disordered segments annotated with a given function) found with high counts of M while negative bars show functions where M is depleted. Panels e and f compare the amino acid compositions of the ribosomal, RNA- and DNA-binding proteins. In e, the fractional difference was calculated as (C x  − C order)/C order, where C x is the content of a given amino acid in a query set, and C order is the corresponding content in the dataset of fully ordered proteins. In f, the compositions of the RNA- and DNA-binding proteins are compared with the general amino acid composition of the ribosomal proteins. Here, the normalized compositions of the RNA- and DNA-binding proteins are evaluated in the (C s  − C ribosomal)/C ribosomal form, with C s being a content of a given residue in a dataset of the RNA- or DNA-binding proteins), and C ribosomal being the corresponding value for ribosomal proteins. In both plots, composition profiles of typical intrinsically disordered proteins from the DisProt database are shown for comparison (black bars)

Compositions of globular domains and extended regions

We analyzed peculiarities of the amino acid compositions of globular and disordered domains predicted using the GlobPlot server. Figure 4b shows that all non-globular regions of the ribosomal proteins clearly possess compositions typical for the IDPs/IDPRs, being enriched in major disorder-promoting residues and depleted in order-promoting residues. On the other hand, Fig. 4c illustrates that predicted globular domains possess amino acid biases consistent with the idea that they might contain a significant amount of disorder. In fact, in many respects, the composition profiles of globular domains resemble profiles calculated for the full-length ribosomal proteins. These domains are depleted in all order-promoting residues except for isoleucine and are enriched in some disorder-promoting residues (e.g., G, A, K, and E). Figure 4d provides further analysis of amino acid methionine that we found to be substantially enriched in extended regions (Fig. 4b) while being moderately depleted in globular domains (Fig. 4c). We studied the enrichment/depletion of this residue type over all segments with functional annotations (as explained in the “Materials and methods” section); we consider 13 functions that are possessed by at least 20 annotated sequences. We found that enrichment in methionine is associated with several functions carried out by disordered regions, such as polymerization, transactivation, autoregulation, regulation of apoptosis, and interactions with RNA and metals.

Overall characterization of the intrinsic disorder in ribosomal, RNA-, and DNA-binding proteins

Ribosomal proteins are important components of the ribonucleoprotein machine, the ribosome, where they specifically interact with rRNA and other ribosomal proteins. Therefore, it was interesting to compare the various behaviors of the ribosomal protein group (RPG) with those of general RNA- and DNA-binding proteins. To this end, representative sample sets of RNA- and DNA-binding proteins were assembled as described in the “Materials and methods” section and these three datasets were used in the subsequent studies.

Figure 4e and f represent the comparison of amino acid compositions of the ribosomal proteins, RNA- and DNA-binding proteins. In Fig. 4e, the normalized amino acid compositions of these three classes of nucleic acid-binding proteins are shown. Here, the normalized compositions were calculated as described above; i.e., in the (C s  − C order)/C order form, where C s is a content of a given residue in a query dataset (IDPs, ribosomal, RNA- and DNA-binding proteins), and C order is the corresponding value for the set of ordered proteins from PDB Select 25 [63]. This figure shows that all nucleic acid-binding proteins are characterized by comparable depletion in order-promoting residues. As far as disorder-promoting residues are concerned, while the RNA- and DNA-binding proteins generally follow the trend typical for the IDPs, being moderately enriched in major disorder-promoting residues, the ribosomal proteins are quite different. Two major features strike the eye—substantial enrichment of the ribosomal proteins in R and K compensated by noticeable depletion in D, Q, S, and E residues. To get a better understanding of the amino acid composition biases of the RNA- and DNA-binding proteins relative to the ribosomal proteins, we evaluated their normalized compositions in the (C s  − C ribosomal)/C ribosomal form, with C s being a content of a given residue in a dataset of the RNA- or DNA-binding proteins), and C ribosomal being the corresponding value for ribosomal proteins. Results of this analysis are shown in Fig. 4f, which re-emphasizes the relative depletion of the RNA- and DNA-binding proteins in N, D, Q, S, E, and P and their depletion in V, R, A, and K. Generally, data shown in Fig. 4e and f suggest that the RNA- and DNA-binding proteins are closer to each other than to the ribosomal proteins.

The average disorder content (i.e., the fraction of disordered residues) in the ribosomal protein group (RPG) ranges between 36 and 37.4 % across the three domains of life, see Fig. 5. This is substantially higher than the overall disorder content in various proteomes, which was estimated to be 18.9, 5.7, and 3.8 % for Eukaryota, Bacteria, and Archaea, respectively [3]. Our results indicate similar levels of disorder in the three domains of life and across the 32 considered species, with the lowest content at over 28 %. Figure 5 also shows that between 2.5 and 23.2 % of ribosomal proteins across the 32 species are fully disordered, with the largest average fraction (11.7 %) of fully disordered chains being found in the bacterial species.

Fig. 5
figure 5

Disorder content (crosses and lines) and fraction of fully disordered proteins (black bars) in different species and domains of life for the ribosomal, DNA-, and RNA-binding proteins. The species, which are shown on the x-axis, are grouped into the Eukaryota, Archaea, and Bacteria domains

This behavior of ribosomal proteins is rather different from that of DNA- and RNA-binding proteins. In fact, disorder in DNA- and RNA-binding proteins is unevenly distributed among the three domains of life, with proteins from Eukaryotes being substantially more disordered than corresponding proteins from Archaea and Bacteria. Interestingly, the overall disorder contents of eukaryotic ribosomal and RNA-binding proteins are rather similar (~37 and 41 %, respectively) whereas eukaryotic DNA-binding proteins possess more disorder (~60 %). However, in Archaea and Bacteria, the situation is reversed, and ribosomal proteins are more disordered than RNA- and DNA-binding proteins (see Fig. 5). Fully disordered eukaryotic ribosomal proteins are somewhat more abundant than fully disordered RNA-binding proteins and noticeably less abundant than fully disordered DNA-binding proteins. In Archaea and Bacteria, fully disordered chains are essentially more abundant among the ribosomal proteins than among the corresponding RNA- and DNA-binding proteins.

Figure S1 reveals that, on average, ribosomal proteins have between 1.4 (in Eukaryota) and ~1.5 (in Bacteria and Archaea) disordered segments per 100 residues (we normalize by unit of length to allow direct comparison to longer DNA- and RNA-binding chains), including 0.3–0.4 long disordered segments (>30 amino acids) per 100 residues. Therefore, according to all these parameters, ribosomal proteins are substantially more disordered than RNA- or DNA-binding proteins. This is an interesting observation since ribosomal proteins are typically significantly shorter than RNA- and DNA-binding proteins (see Figure S1).

We further analyze the distribution of the disordered regions across chains with different length, see Fig. 6. While in Archaea the number of long disordered segments in ribosomal proteins increases linearly with the length of the protein chain, we observe increased number of disordered segments for short chains in Eukaryota and Bacteria (see Fig. 6a). Furthermore, short (<100 amino acids) fully disordered ribosomal proteins are relatively common in Eukaryota and Bacteria, where about 1/3 of short chains are fully disordered. In contrast, Archaea has some longer fully disordered chains. This is due to the inclusion of Halobacterium salinarum (HAL) that has the highest disorder content (59.3 %), which stems from the fact that it has the largest fraction (23.2 %) of fully disordered proteins among all considered species, see Fig. 5. Overall, our analysis implies that small ribosomal proteins in Eukaryota and Bacteria are enriched in disorder, when compared with the ribosomal proteins in Archaea. These behaviors are different from trends observed for the DNA- and RNA-binding proteins, which typically possess less disorder-related features than ribosomal proteins, except for the eukaryotic DNA-binding proteins, and whose disorder attributes decrease with the protein length (see Fig. 6b, c).

Fig. 6
figure 6

The number of long disordered segments (30 or more residues) per protein (y-axis on the left; hollow points) and the fraction of fully disordered protein (y-axis on the right; solid bars) against protein length (x-axis) across the three domains of life in ribosomal (a), RNA- (b), and DNA-binding proteins (c)

Characterization of the domains in ribosomal, RNA-, and DNA-binding proteins

Application of the GlobPlot and MFDp tools to the set of 3,438 ribosomal proteins revealed that 412 proteins (12.0 %) were predicted without globular domains, 502 proteins (14.6 %) were predicted not to have disordered regions, whereas the remaining proteins were predicted to be hybrid proteins that contained both globular and disordered domains. Figure 7a shows that in all three kingdoms of life, most ribosomal proteins with globular domains are single domain proteins (in ~60 % proteins, >95 % residues are included in a GlobPlot predicted domain). However, more detailed analysis of globular domains using the MFDp tool showed that many of them contained disordered regions and some are predicted to be entirely disordered (see Fig. 7b). Figure 7c shows that almost all globular domains contain at least one disordered region with at least four consecutive disordered residues, and ~20 % of domains were significantly disordered, containing at least half disordered residues.

Fig. 7
figure 7

Characterization of the globular domains in ribosomal proteins. Globular domains were predicted using the GlobPlot server (http://globplot.embl.de/). a The distribution of fraction of amino acids in domain per protein. b The distribution of disorder content per domain. c The fraction of disordered domains (hollow and solid circles, respectively; y-axis on the left) and the average length of disordered (red and orange bars) and ordered domains (dark and bright green bars; y-axis on the right). Domains were assumed to be disordered when they contain at least one disordered region with at least four consecutive disordered residues (def_1) or when at least half of their residues are disordered (def_2)

Figure 8 represents the results of CH–CDF analysis of ribosomal proteins and provides further support to their highly disordered nature. In this plot, the coordinates of each spot are calculated as a distance of the corresponding protein in the CH-plot (charge–hydropathy plot) from the boundary (Y-coordinate) and an average distance of the respective cumulative distribution function (CDF) curve from the CDF boundary (X-coordinate) [7375]. The quadrants of CDF-CH phase space correspond to the following expectations: Q1, proteins predicted to be disordered by CH-plots, but ordered by CDFs; Q2, ordered proteins; Q3, proteins predicted to be disordered by CDFs, but compact by CH-plots (i.e., putative molten globules or proteins with alternating ordered and disordered regions); Q4, proteins predicted to be disordered by both methods (i.e., proteins with extended disorder). Although these classifications could be questionable for large, multidomain proteins, they provide relatively unbiased description of ribosomal proteins, which are typically small proteins.

Fig. 8
figure 8

a Evaluation of the abundance of intrinsic disorder in ribosomal proteins from the three domains of life, Bacteria (green circles), Archaea (red circles), and Eukaryota (yellow circles), in the form of a CH–CDF plot [74, 75]. b CH–CDF plot for archaeal ribosomal proteins that are split on globular (dark red) and non-globular domains (red). c CH–CDF plot for bacterial ribosomal proteins that are split on globular (dark green) and non-globular domains (green). d CH–CDF plot for eukaryotic ribosomal proteins that are split on globular (dark yellow) and non-globular domains (yellow)

Figure 8a shows that many full-length ribosomal proteins are predicted to be disordered as a whole, with >60 % of all ribosomal proteins being found in Q1, Q3, and Q4, and being therefore expected to behave as native molten globules, native coils, or native pre-molten globules in their unbound states. The distribution of archaeal, bacterial, and eukaryotic proteins between the four quadrants of the CH–CDF plot is as follows: Archaea, 9.2 % (Q1), 37.2 % (Q2), 17.6 % (Q3), and 36.0 % (Q4); Bacteria, 11.7 % (Q1), 35.5 % (Q2), 15.4 % (Q3), and 37.4 % (Q4); and Eukaryota, 17.1 % (Q1), 30.6 % (Q2), 14.1 % (Q3), and 38.2 % (Q4). Therefore, ribosomal proteins from different life domains are different in their disorder propensities, and can be sorted as Archaea > Bacteria > Eukaryota by the number of ordered proteins in their Q2 quadrants. There is also an unusual bias in the number of ribosomal proteins populating Q1, which is typically considered as a quadrant containing rare proteins [75]. In fact, our analysis shows that between 9 and 17 % of ribosomal proteins are found in Q1, whereas only 2.5 % proteins from entire mouse proteome are in this quadrant. Earlier, it was pointed out that Q1 proteins might have functions related to interaction with RNA, with four of the five distinctive GO terms found for these proteins dealing with RNA binding and modification [75]. By the CH analysis, these Q1 proteins are highly charged, and this feature may be related to their ability to interact with RNA [75].

Figure 9b, c, and d represent CH–CDF plots for globular and non-globular domains of ribosomal proteins from the three kingdoms of life. Results of this analysis are further summarized in Table S3, which shows that non-globular domains are systematically predicted to be mostly disordered and that many GlobPlot identified globular domains are expected to be disordered. In fact, quadrants Q3 and Q4 of the CH–CDF plots that typically correspond to the disordered proteins/domains/regions contain 15.5 % (Q3) and 27.8 % (Q4) of predicted archaeal globular domains, 16.8 % (Q3) and 21.8 % (Q4) of predicted bacterial globular domains, and 10.8 % (Q3) and 32.3 % (Q4) of GlobPlot predicted eukaryotic globular domains. Table S3 also shows that 21.7, 19.2, and 9.9 % of archaeal, bacterial, and eukaryotic ribosomal proteins were predicted to be devoid of globular domains.

Fig. 9
figure 9

Distribution of the length of the disordered segments across the three domains of life of ribosomal proteins (a) and the corresponding cumulative distribution (b). Length distributions of corresponding ribosomal proteins (c) with its cumulative distribution (d)

All these data clearly show that intrinsic disorder is very common in ribosomal proteins form all three kingdoms of life.

Functional analysis of disordered segments in ribosomal proteins

Distributions of the sizes of the disordered segments in ribosomal proteins across the three domains of life are shown in Fig. 9a. Interestingly, we observe that the sizes follow bimodal distribution with a relatively large number of short segments (between 4 and 15 amino acids) and with a second peak for longer fragments (between 25 and 100 amino acids). Figure 9b represents the overall ribosomal protein length distributions and shows that these proteins are relatively short and possess the average length of about 100–150 residues.

Since intrinsically disordered regions have a bimodal length distribution, we analyze the function for two classes of the disordered segments: short segments with <30 amino acids, and long with at least 30 amino acids. For ribosomal proteins, we consider 26 functions, which are annotated based on sequence alignment into the functionally characterized disordered segments from the DisProt database (as explained in the “Materials and methods” section), which are summarized in Table S2. We exclude functions with <20 annotations for both short and long disordered segments.

Figure 10 compares the annotations of the 13 remaining predicted (using alignment) functions between the short and long disordered segments of ribosomal proteins. The results reveal that disorder in ribosomal proteins plays several important roles, from facilitating the protein–protein, protein–DNA, protein–RNA, and protein–other–ligand interactions, to involvement in metal binding, post-translational modifications, and implementation of linkers and intra-protein interactions. Overall, both long and short disordered segments are equally implicated in several functions, including interactions with proteins, DNA, and ligands. The short segments are predominant in a larger number of functions, including RNA and metal binding, auto-regulatory functions, transactivation, polymerization, apoptosis, and are more prevalent in the post-translational modification sites. At the same time, the long disordered segments more often serve as linkers and play a strong role in intra-protein interactions. Our analysis provides useful clues that can be used to narrow down potential functions of IDPs and IDPRs, especially knowing the size of the corresponding segments, in ribosomal chains that currently lack functional annotations.

Fig. 10
figure 10

Fraction of short (4–30 amino acids) and long (over 30 amino acids) disordered segments for a given function; x-axis represents the 13 considered functions sorted by the decreasing number of short segments

The results of the predictions of potential binding sites were validated against the functions of known components. To this end, the predicted binding sites of proteins in the yeast ribosome were compared to the ribosome structure to determine whether regions predicted to be involved in binding of proteins and RNA actually perform these functions. Potential protein–protein interaction sites were predicted in 13 proteins that are found in the crystal structure of the yeast ribosome: S8-A (residues 119–150), S17-A (residues 1–5), S19-A (residues 1–5), S20 (residues 1–23), S26-A (residues 83–119), S27-A (residues 78–82), L4-A (residues 1–19), L10 (residues 217–220), L18-A (residues 140–186), L22-A (residues 101–121), L28 (residues 89–93), L31-A (residues 108–113), and L40-A (residues 35–38). RNA-binding site was predicted in L31-A (residues 108–113). Analysis of the crystal structures of the yeast ribosomal subunits revealed that there is a reasonably good correlation between the predicted and actual binding sites, since many predicted protein–protein interaction sites of the yeast ribosomal proteins either coincided, or overlapped, or were located in the close proximity to the actual binding sites. For example, in the crystal structure of the small ribosomal subunit, residues 117 and 149–153 of S8-A are involved in interaction with S11-A; N-terminal residues 8, 12, 15–16 and 18–19 of the S17-A interact with protein S3; residues 6–12 of S19-A are at the interface with S16-A; residues 25–29 of S20 bind to S3; S26-A interacts with S14-A and S2 via residues 42–71 and 59–70, respectively; S27-A is engaged in binding to S13 and S7-A via residue 82. In the crystal structure of the 60S ribosome, residues 28–33 of L4-A protein interact with residues 123–133 of L18-A; region containing residues 206–221 of L10 is at the interface with L5; besides being involved in interaction with L4-A, residues 164–172 of L18-A bind to L13-A; L28 binds to L13-A via region containing residues 96–111; the interaction between L40-A and L9-A is secured by residues 77–91. The fact that the predicted binding sites of L22-A and L31-A were not involved in interaction with other ribosomal proteins does not necessarily mean wrong prediction, since these regions (as well as predicted binding regions of other yeast ribosomal proteins) can be engaged in binding to non-ribosomal proteins.

MoRF regions in ribosomal, RNA-, and DNA-binding proteins

The most prevalent function of disorder in ribosomal proteins is facilitation of protein–protein interactions. Figure 11 shows that well over 30 % of the functionally annotated disordered segments in ribosomal proteins are implicated in these binding events. This motivates our analysis of MoRFs regions [2, 7880], which are defined as short disordered regions that undergo disorder-to-order transition upon binding to protein partners and fold into mostly helical (α-MoRFs), strand (β-MoRFs), coil (ι-MoRFs) and complex (complex-MoRFs, which combine multiple secondary structure) secondary structures. Figure 11a demonstrates that there are on average about 0.85 MoRFs per 100 residues (we normalize by unit of length to allow direct comparison to longer DNA- and RNA-binding chains) in eukaryotic ribosomal proteins, including a large fraction of α-MoRF and ι-MoRF and relatively lower numbers of complex- and β-MoRFs. The complex-MoRFs, ι-MoRFs, and α-MoRFs are similarly abundant in ribosomal chains from the three domains of life, while bacterial and archaeal ribosomal proteins are enriched in β-MoRFs. Both RNA- (Fig. 11b) and DNA-binding proteins (Fig. 11c) have fewer MoRF regions per 100 residues, and are characterized by rather different distributions of the overall abundance of MoRFs (which vary more widely between species) and their split into α-, β-, ι-, and complex-MoRFs between eukaryotic, archaeal and bacterial proteins, particularly for DNA-binding chains that are depleted in β-MoRFs. This suggests that MoRF regions in the ribosomal chains may be involved in different types of protein–protein interactions across different domains.

Fig. 11
figure 11

Number of MoRFs per protein, shown using stacked bars, across different species and domains. The bars are subdivided using colors that correspond to different MoRF types. The solid lines show a cumulative (over MoRF types located below the line) average number of a given MoRF type for each of the three domains. The species, which are shown on the x-axis, are grouped into Eukaryota, Archaea, and Bacteria domains. Plots a, b, and c correspond to ribosomal, and RNA- and DNA-binding proteins, respectively

Evolutionary conservation of disorder in ribosomal proteins

Next, we investigate evolutionary conservation of intrinsic disorder in ribosomal proteins. The conservation is quantified using the relative entropy computed from the Weighted Observed Percentages (WOP) profiles generated by PSI-BLAST (as explained in the “Materials and methods” section). Higher values of the relative entropy indicate a higher degree of conservation. Figure 12 shows that ribosomal, RNA-, and DNA-binding proteins in Bacteria are characterized by higher levels of conservation when compared with the Archaea and Eukaryota. This can also be observed in Fig. 13 where we compare conservation between disordered and ordered residues. Besides the overall trend that shows higher conservation in Bacteria, our results show that disordered residues are more conserved when compared with the structured parts of the ribosomal proteins (see Fig. 13a). This is true for all species in Eukaryota and Archaea, while in Bacteria the disordered and ordered residues have similarly high conservation. Moreover, we show that residues located in long disordered segments of ribosomal proteins are more conserved than the overall population of both disordered and ordered amino acids across all three domains of life. In eukaryotic RNA-binding proteins, the situation is reversed and ordered regions are more conserved (Fig. 13b), whereas eukaryotic DNA-binding proteins are characterized by the higher conservation of long disordered and ordered regions (see Fig. 13c). This suggests that disorder plays important role in all the kingdom of life from the evolutionary perspective, particularly in ribosomal proteins where it is characterized by higher conservation levels.

Fig. 12
figure 12

Distribution of the average relative entropy, which quantifies evolutionary conservation, for the proteins from Eukaryota, Archaea, and Bacteria. Plots a, b, and c correspond to ribosomal, and RNA- and DNA-binding proteins, respectively

Fig. 13
figure 13

The average relative entropy that quantifies evolutionary conservation across different species and domains. Blue points/lines, green triangles/lines, and orange crosses/lines denote the average relative entropy of disordered residues in long disordered segments, all disordered residues, and ordered residues, respectively. The species, which are shown on the x-axis, are grouped into Eukaryota, Archaea, and Bacteria domains. Plots a, b, and c correspond to ribosomal, RNA- and DNA-binding proteins, respectively

Orthology and disorder in ribosomal proteins

Using a representative organism from each kingdom of life (H. sapiens, E. coli, and S. tokodaii) we annotated proteins for all pairs of the selected species as either orthologous or non-orthologous using the data available in RPG database [54]. The selected bacterial and archaeal species have the largest proteomes in their respective sets of species. The overall disorder content in the three species and the content for their orthologous or non-orthologous proteins is summarized in Fig. 14. We observe that the orthologous chains are characterized by lower amounts of disorder compared to the amount of disorder for the corresponding non-orthologous proteins. This trend is true across all three proteomes, which suggests that disorder may play a role in specializing and adjusting the ribosome for a particular kingdom of life.

Fig. 14
figure 14

Comparison of disorder content between orthologous (green bars) and non-orthologous (red bars) proteins across all pairs of the selected species from the three kingdoms of life, including H. sapiens (HOMO), E. coli (ECO), and S. tokodaii (SUL). The hollow bars denote the overall, for a given species, disorder content. The numbers above the bars indicate the corresponding count of orthologous and non-orthologous chains

Discussion

Commonness and peculiarities of intrinsic disorder in the ribosomal proteins

We are showing in this study that intrinsic disorder is widely spread among the ribosomal proteins from all three kingdoms of life. This conclusion is in line with the results of the analysis of crystal structure of the eukaryotic ribosome from the yeast S. cerevisiae that revealed that many ribosomal proteins contain regions of intrinsic disorder, which are seen as regions with missing electron density [57]. Many ribosomal proteins contain IDPRs that are at least 8 residues long, with some IDPRs can be as long as 94 residues. The illustrative examples of such proteins are listed in Supplementary Materials. We also point out that many of the eukaryotic core proteins contain eukaryote-specific extensions that interact with the rRNA expansion segments in 60S subunit. For example, the conserved proteins that are associated with the polypeptide exit tunnel, L22, L4, L23, and L29 all contain very long extensions, up to 140 Å in the case of L4, that reach the periphery of 60S [57]. Another protein with a very unusual configuration is L24e whose N-terminal domain resides in the 60S subunit whereas the C-terminal domain reaches to the back of the 40S subunit due to the presence of a long flexible linker that protrudes deep into the side of the 40S body [57].

Visual analysis of the crystal structures of individual ribosomal proteins revealed that many of them possess very unusual morphologies inconsistent with simple globular structures suggesting that these structures are likely to be formed as a result of binding-induced folding (see Fig. 1a). This hypothesis is supported by the computational analysis of these structures in the form of Nussinov’s plot, where the vast majority of eukaryotic ribosomal proteins is found above the order–disorder boundary suggested by Gunasekaran et al. [60]. In order to understand whether the globular domains seen for many ribosomal proteins are independent folding units or are formed due to the binding-induced disorder-order transitions, we developed a tool (discrete finite automaton, DFA) to computationally separate proteins with known 3D-structure into globular domains and non-globular parts. The subsequent Nussinov’s plot analysis showed that many globular domains were formed due to binding to other components of the ribosome (Fig. 2). These findings provide very important support to the hypothesis that many eukaryotic ribosomal proteins are mostly disordered in their unbound states.

To understand how general this statement is, we next analyzed a large dataset of ribosomal proteins from all kingdoms of life. Application of various computational tools unequivocally showed that disorder is very common in all the ribosomal proteins and that many potential globular domains still possess noticeable levels of disorder (see Figs. 4, 5, 6, 7, 8). Since disorder is reliably predicted using computational tools developed based on the disorder-related data from large databases (e.g., PDB), one can conclude that disordered regions of ribosomal proteins are generally similar in their properties to disordered regions of many other proteins observed in several large databanks.

The ribosome is a ribonucleoprotein machine whose proteins are involved in interactions with both proteins and RNA. To understand how ribosomal proteins differ from other nucleic acid binding proteins, we compared some of their disorder-related features with disorder characteristics of large randomly selected sets of RNA- and DNA-binding proteins. Data shown in Figs. 4, 5, 9, 11, 12, and 13 suggest that disorder in ribosomal proteins, especially its functional roles, and evolutionary features are different from those aspects of disorder in DNA- and RNA-binding proteins. It is likely that some of these differences are related to the functional uniqueness of ribosomal proteins, many of which are involved in multiple simultaneous binding events, being involved in interactions, not only with RNA, but also with neighboring proteins. Some of the reasons for the abundance of disorder in ribosomal proteins are considered in the next several paragraphs.

Why is intrinsic disorder so common in the ribosomal proteins?

Functional viewpoint: protein–rRNA and protein–protein interactions on the ribosome

Being components of a large ribonucleoprotein complex, ribosomal proteins are obviously involved in interaction with both RNA and other proteins. Their ability to bind to RNA is determined by high positive charge. In general, ribosomal proteins are very basic (average pI ~ 10.1), suggesting that a general function of these proteins may be to counteract the negative charges of the phosphate residues in the rRNA backbone. In agreement with this hypothesis, many ribosomal proteins were shown to serve as RNA chaperones and therefore play crucial roles during the ribosome assembly [108, 109]. The only exceptions to this rule are S1 and S6 in the small subunit and the L7/L12 proteins in the large subunit,  none of which have significant contacts with RNA, being predominantly engaged in the protein–protein interactions. Here, L7/L12 interact directly with L10 to form the pentameric L10 × (L7/L12)4 or heptameric L10 × (L7/L12)6 complex, S6 makes extensive contact with S18, and S1 interacts with S21, S11, and S18 [109].

Many ribosomal proteins possess complex structures and are often characterized by a tadpole-like shape (see Fig. 1) containing a globular domain, which is generally located on the surface of the ribosome, and a long extended region that penetrates into the ribosome’s interior. In fact, all S-proteins (except S4 and S15) and about 50 % of the L-proteins possess such extensions, which have distinctive amino acid compositions, containing multiple Gly residues to allow flexibility and tight packing, and which are rich in basic amino acids to interact with rRNA [109]. In fact, the content of the basic amino acids Arg/Lys in the extensions of the large subunit ribosomal proteins (27 %) noticeably exceeds that of the globular parts (19 %). As a result, these extensions that constitute only ~20 % of the protein mass of the large subunit are responsible for burying of ~50 % of total RNA surface area [109]. It was pointed out that some ribosomal proteins, when studied in isolation, contain globular regions, whereas their extended tails are typically not observed in the isolated structures [109], suggesting that these regions undergo disorder-to-order transitions induced by interaction with rRNA. Among the most extreme examples are the long extensions of L2 and L3 that reach towards the peptidyl-transferase center. S12 has an extremely long extension that starts from the globular domain located adjacent to the decoding center on the intersubunit side of the small subunit and reaches all the way to the back or solvent side of the 30S subunit where it interacts with S8 and S17. Thus, S12 provides an illustrative example of the “penetrator” binding mode, where a significant part of an IDP penetrates deep inside the structure of its binding partner [110]. Also, the 61 amino acid ribosomal protein S14 is completely devoid of any globular domain [109]. Therefore, IDPRs of many ribosomal proteins are important foldable regions that serve to ensure the formation of a correctly folded rRNA state during the ribosome assembly process and also support the correct conformation of the rRNA in the final assembled complex [109].

Besides the aforementioned intensive contacts with rRNA, several ribosomal proteins are involved in a well-developed network of protein–protein interactions. For example, a tight heterodimeric complex is formed by S6 and S18 proteins on the outer edge of the platform of the small subunit, whereas at the back of the 30S head, S3, S10, and S14 form a tight complex, and in the large subunit there are previously mentioned pentameric L10 × (L7/L12)4 or heptameric L10 × (L7/L12)6 protein complexes [109]. Formation of these tight protein–protein complexes may also involve disorder-to-order transition, at least in some parts of the interacting proteins.

Functional viewpoint: specific on-ribosome functions

It was recognized long ago that some ribosomal proteins are mostly essential for the assembly of the ribonucleoprotein particle and are dispensable for function after the ribosomal subunits are fully assembled [111], suggesting that the major function of these “dispensable” proteins (e.g., S16, L15, L16, L20, and L24) in the assembled ribosome could be to improve the ribosome stability. Furthermore, there are several ribosomal proteins that are not essential for the translational function of the ribosome, the hypothesis based on the observations E. coli strains lacking S6, S9, S13, S17, S20, L1, L9, L11, L15, L19, L24, L27 to L30, and L33 are viable [109, 112, 113]. Since the subject of the on-ribosome functions of the ribosomal proteins was covered in a recent in-depth review [109], we are simply listing some of these functions in the Supplementary Materials. Interested readers are encouraged to look for the original review, where the functional roles of many ribosomal proteins were considered in great detail [109].

All of these functions are relying on multiple interactions with various partners, suggesting that ribosomal proteins can be considered to be ribosomal hubs. Earlier, it was shown that binding promiscuity of hubs benefit from the use of intrinsic disorder in one of the two ways, where one disordered region can bind to many different partners and many disordered region can bind to one partner [13, 114119].

Functional viewpoint: moonlighting or off-ribosome functions

The core ribosome functions, i.e., the precise interaction of mRNA codon with tRNA anticodon and the catalysis of peptide bond formation, are carried out by rRNA molecules of the small and the large ribosomal subunits, respectively. Therefore, the major or core on-ribosome functions of ribosomal proteins are to assist in rRNA folding (i.e., to serve as RNA chaperones) and function, to assist in the ribosome assembly, and to be involved in related protein–protein, protein–rRNA, protein–mRNA, and protein–tRNA interactions. On the other hand, many ribosomal proteins have been shown to be involved in some extra-ribosomal or auxiliary functions, thereby being involved in moonlighting activities. In agreement with this hypothesis, numerous extra-ribosomal functions have been assigned to ribosomal proteins [120124]. It was even stated recently that “moonlighting is particularly widespread among ribosomal proteins, many of which have extra-ribosomal employment” [122]. Even the first systematic analysis of this subject (which was performed in 1996) revealed that ribosomal proteins might have up to 30 extra-ribosomal functions [120]. Recently, it was emphasized that the numerous extra-ribosomal functions of ribosomal proteins reported in the literature so far can be grouped into two major categories: 1. Control the balance among the ribosomal components; and 2. Control nucleolar stress or aberrant ribosomal synthesis, leading to cell cycle arrest or apoptosis [124]. Some of the extra-ribosomal functions of ribosomal proteins within the ribosome system were already described above (e.g., see notes for S1, L1, and L4) and are covered in great detail in a recent review [124]. In E. coli, these extra-ribosomal activities include the L4-mediated inhibition of translation of the S10 operon that encodes 11 different ribosomal proteins including L4 itself [42] and binding of L4 to RNase E thereby leading to the modulation of its acitivity, resulting in the stress-related changes in the mRNA composition [125]. It was emphasized that, among other regulatory ribosomal proteins, L4 occupies a unique position due to its ability to regulate both the transcription and the translation of the L4-mRNA-containing S10 operon [126128]. Furthermore, via a comprehensive analysis of deletion and point mutants, these two functions of L4 were assigned to different regions of this protein [129]. In fact, although the C-terminal region of L4 (residues 171–201) was shown to be crucial for the L4-mediated autogenous control, it was not involved in the incorporation of this protein to the ribosome. On the other hand, the central region of L4 (residues 67–103) was involved in the ribosome assembly but did not play a significant role in the regulatory L4 functions [129]. Curiously, the last third of the regulatory C-terminal fragment of L4 is predicted to be highly disordered, whereas the central region required for the ribosome assembly is expected to be mostly disordered throughout its entire length.

In Eukaryotes, L30 inhibits splicing by binding to its own transcript [130], S14 controls the splicing of the transcript of one of its genes [131], L2 controls the level of its mRNA through accelerated turnover [132], S13 binds to the first intron of its transcript to inhibit splicing [133, 134], and L12 controls its own synthesis by inhibiting the splicing of its own mRNA [135]. In addition to these roles in the control of the balance among ribosomal components during the ribosome synthesis, the established off-ribosome functions of ribosomal proteins are related to the surveillance of the ribosome assembly, as well as numerous roles in development, apoptosis, and cancer [124]. It is very likely that the ability of ribosomal proteins to act off the ribosome can be attributed to their intrinsically disordered nature. This hypothesis is in agreement with a recent analysis that showed that the structural malleability characteristic for the IDPs/IDPRs is strongly associated with the ability of proteins to moonlight [121].

Evolutionary viewpoint

Ribosomes are intricate subjects for evolutionary analysis, since they are found in all living cells where they are absolutely necessary for protein biosynthesis. It was pointed out that although ribosomal proteins are generally highly conserved within the different domains of life, there is a noticeable difference between the ribosomal proteins of Bacteria, Archaea, and Eukaryota [41, 109]. In fact, only ~30% of bacterial, eukaryotic and archaeal ribosomal proteins are considered to be orthologous. An additional 30 % of the ribosomal proteins are in common between the archaeal and eukaryotic ribosomes. However, no proteins are exclusively in common between the bacterial and archaeal ribosomes or between the bacterial and eukaryotic ribosomes, thus supporting the theory that the separation of the common ancestor of Archaea and Eukarya form the Bacteria happened before the Archaea and Eukarya become separated [41, 109]. The high sequence conservation detected in several ribosomal proteins (especially those critical for ribosomal function and assembly) indicates their functional importance.

On the other hand, the ribosomes with their unique ribozymatic activities support the validity of the “RNA world” theory, according to which the biosphere once was dominated by organisms in which RNA was used for both information storage and catalysis [136]. Based on this hypothesis and on the assumption that during the evolution of enzymatic activity, catalysis was transferred from RNA to ribonucleoprotein to protein, it was proposed that the first proteins to come into being were RNA chaperones [137, 138]. In fact, it is rather obvious that the first proteins should be short and unfolded polypeptides [139], since the chance for the spontaneous appearance of a polypeptide chain capable of folding into a unique 3-D structure is extremely low. Furthermore, the first biological functions of these disordered primordial polypeptides are also obvious—they have to be involved in interactions with ribozymes to stabilize their unstable and prone-to-misfold structure. In fact, it is well known that the single-stranded RNAs are flexible macromolecules and can fold into a wide variety of alternative conformations. However, for a given ribozyme, only one given conformation is functionally relevant. Therefore, in order for a given RNA to reach the biologically relevant conformation and not be trapped in one of the many structurally available but functionally incorrect structures, a special mechanism for assisted RNA folding should be implemented [140]. Currently, this special mechanism mostly relies on RNA chaperone proteins [140]. Therefore, it is reasonable to hypothesize that ancient polypeptides would serve as first RNA chaperones, which via their interactions with primordial RNAs would assist in productive folding of the ancient ribozymes and also would stabilize the biologically active structures of those ribozymes. Since many ribosomal proteins are intrinsically disordered RNA chaperones, the ribosome can clearly be considered as a living fossil that represents a snapshot of one of the early stages of prehistoric development.

In conclusion, this paper presents the results of comprehensive computational analyses of ribosomal proteins and shows that the vast majority of these important RNA-binding proteins are IDPs with ribosome-specific sequence features. We also show that intrinsic disorder is very important for various biological functions of ribosomal proteins, being commonly used in numerous interactions of any given ribosomal protein with its various binding partners of different nature, such as other ribosomal proteins, RNA, and proteins from the translational machinery. The intrinsically disordered nature of ribosomal proteins is highly conserved in different domains of life, indicating that the lack of rigid structure, the resulting ability of ribosomal proteins to interact with various binding partners and be involved in the wide spectrum of moonlighting activities represent a strong evolutionary advantage. Therefore, careful consideration and appreciation of intrinsic disorder are crucial for better understanding of structure and conformational behavior of ribosomal proteins, their promiscuity, molecular mechanisms of their numerous extra-ribosomal functions, and mechanisms underlying regulation and control of these very important proteins.