The scale-free nature of protein sequence space

Patrick C. F. Buchholz; Catharina Zeil; Jürgen Pleiss

doi:10.1371/journal.pone.0200815

Abstract

The sequence space of five protein superfamilies was investigated by constructing sequence networks. The nodes represent individual sequences, and two nodes are connected by an edge if the global sequence identity of two sequences exceeds a threshold. The networks were characterized by their degree distribution (number of nodes with a given number of neighbors) and by their fractal network dimension. Although the five protein families differed in sequence length, fold, and domain arrangement, their network properties were similar. The fractal network dimension D_f was distance-dependent: a high dimension for single and double mutants (D_f = 4.0), which dropped to D_f = 0.7–1.0 at 90% sequence identity, and increased to D_f = 3.5–4.5 below 70% sequence identity. The distance dependency of the network dimension is consistent with evolutionary constraints for functional proteins. While random single and double mutations often result in a functional protein, the accumulation of more than ten mutations is dominated by epistasis. The networks of the five protein families were highly inhomogeneous with few highly connected communities ("hub sequences") and a large number of smaller and less connected communities. The degree distributions followed a power-law distribution with similar scaling exponents close to 1. Because the hub sequences have a large number of functional neighbors, they are expected to be robust toward possible deleterious effects of mutations. Because of their robustness, hub sequences have the potential of high innovability, with additional mutations readily inducing new functions. Therefore, they form hotspots of evolution and are promising candidates as starting points for directed evolution experiments in biotechnology.

Citation: Buchholz PCF, Zeil C, Pleiss J (2018) The scale-free nature of protein sequence space. PLoS ONE 13(8): e0200815. https://doi.org/10.1371/journal.pone.0200815

Editor: Dante R. Chialvo, Consejo Nacional de Investigaciones Cientificas y Tecnicas, ARGENTINA

Received: April 4, 2018; Accepted: July 3, 2018; Published: August 1, 2018

Copyright: © 2018 Buchholz et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This work was funded by the Deutsche Forschungsgemeinschaft (FOR 1296 (JP), EXC 310 (CZ), PL145/16-1 (PCFB). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Power laws of the form f(x) ~ x^γ are ubiquitous in many physical systems and describe scale free phenomena, for which changing the scale of the independent variable x preserves the functional form f of the solution (f(λx) = λ^γ f(x)) [1]. Because scaling is a manifestation of the dynamics and geometry of a physical system, scaling laws reflect underlying generic features and provide insight into important universal principles, characterized by the scaling exponent γ.

Power laws also play an important role in life sciences. Spanning many orders of magnitude, fundamental variables such as metabolic rate, growth rate, or tree height follow a power law with an exponent γ which is an integer multiple of ¼ [2]. The observation of scaling relationships throughout the living world has inspired the search for basic principles that explain complex biological phenomena from unicellular organisms to trees. Power laws also describe population genetics for unlinked loci in the monomorphic limit and are a consequence of Darwin's theory of evolution [3]. For proteins, scaling relations were observed for the solvent-accessible surface area [4], the packing [5], and the equilibrium dynamics [6], and it has been suggested that near-criticality might be a characteristic of biological systems [7].

Power law distributions have also been detected in sequence similarity networks of proteins [8] and have been interpreted as a consequence of evolution [9,10] or the constraints of protein structure [9,11]. Detailed modelling of protein evolution is challenging due to the high complexity of combining random genotypic variation with selection of phenotypic traits such as folding pathway, protein stability, and biological function of the protein. Usually, the effects of mutations are non-additive and dominated by epistasis [12]. Moreover, only an infinitesimally small fraction of the sequence space of proteins has been inspected yet, despite the rapidly growing amount of DNA data due to advances in DNA sequencing techniques. While we currently know the sequences of 10⁸ proteins [13], the number of different protein sequences existing in the biosphere was estimated to be 10³⁴, and up to 10⁴³ different protein sequences might have been explored during 4 Gyr of evolution [14]. Though this number seems to be large, it is extremely small as compared to the number of possible protein sequences (10³⁰⁰ theoretical sequences for a medium-sized protein). Thus, we only know a tiny fraction of the total sequence space of viable proteins, and the theoretical sequence space is sparsely populated by the extant proteins.

In the absence of knowledge about the extant sequence space, relationships between known sequences can be measured by a metric based on global sequence identity or by neighborhood relationships in a protein sequence network where sequences form nodes that are connected by edges [15,16]. While pairwise sequence identity can be determined for all protein families, extended protein sequence networks only exist for families with high microdiversity such as TEM β-lactamases, which form a single connected network of more than 260 single point variants. In this network, the number of neighbors of each sequence is not equally distributed, but follows a power law with a scaling exponent of 1.2 [16]. The protein sequence network of TEM β-lactamases contains a few "hubs" such as TEM-1 and TEM-116 [17] and a large number of less connected nodes, with about 10 times less sequences having 10 times more neighbors each. It is tempting to relate the property of being a highly connected node to the property of being an ancestral sequence by intuitively assuming a preferential attachment model of network generation [18]. However, the observed scale-free degree distribution can result from a variety of different mechanisms [19] and might be determined by the actual constraints of the system rather than a unique mechanism [20].

A central constraint in protein evolution is the evolvability of a protein sequence, which includes two elements, robustness to faults and innovability [21]. Innovability seems to be a consequence of active site location [21]. Robustness can be measured by the tolerance of a protein for deleterious effects of mutations and is related to stability and conformational dynamics of a protein [22,23]. Thus, robustness is expected to vary between and inside a protein family, and it is desirable to identify or construct highly evolvable protein family members as promising starting points for directed evolution experiments.

Methods

Datasets of protein sequences

The datasets of the individual protein families were updated by performing BLAST searches against the non-redundant protein database from the NCBI (GenBank)[24]. The sequence datasets were updated for the families of TEM β-lactamases (TEM, 422 sequences), β-hydroxyacid dehydrogenases/ imine reductases (bHAD, 30781 sequences), thiamine diphosphate-dependent decarboxylases (DC, 39290 sequences), ω-transaminases (oTA, 120921 sequences), and short-chain dehydrogenases/ reductases (SDR, 141496 sequences). In case of TEM β-lactamases, the core region from positions 24 to 280 was used only (referring to TEM-1 position numbering).

Sequence alignments and sequence networks

The distances between pairs of protein sequences can be measured either by counting point mutations or by pairwise sequence alignments. The former metric was applied for the densely connected family of TEM β-lactamases, for which a single point mutation forms the minimal distance between two sequences. TEM β-lactamase protein sequences were connected by an edge, if they differed by one point mutation, which was feasible due to the high microdiversity of this protein family.

Pairwise distances between sequences of the remaining protein superfamilies were calculated by combining the heuristic alignment approach of USEARCH, which reduced the number of sequence pairs, with global Needleman-Wunsch sequence alignment [25,26]. USEARCH alignments were performed to identify highly similar neighbor sequences with an identity threshold of 0.5, corresponding to 50% sequence identity without terminal gaps. In the second step, more accurate global sequence identities were derived from pairwise Needleman-Wunsch alignments (implemented in the EMBOSS bioinformatics software suite [27]) with gap opening penalty of 10 and gap extension penalty of 0.5. USEARCH and EMBOSS were run on multiple threads by applying GNU Parallel [28].

The point-mutation network of TEM β-lactamases and the identity-based networks of the remaining protein superfamilies, i.e. the sequence networks calculated by global sequence alignments as described above, were constructed and visualized by Cytoscape (version 3.4.0) using prefuse force directed layout. For the identity-based networks, prefuse force directed layout was applied with respect to the edge weights (i.e. the higher the sequence identity, the closer the sequences are placed).

Degree distribution and fractal network dimension

For identity-based sequence networks, the number of neighboring sequences for a given sequence, i.e. the degree of a network node, was determined by counting the number of sequence pairs having a minimum sequence identity to the respective sequence, such as ≥95% sequence identity and thus less than 5% pairwise distance. For the point mutation network of TEM β-lactamases, the degree of a network node was determined by counting neighboring sequences within the distance of one point mutation to the respective sequence. The degrees were calculated for all sequences of a given sequence network and the number of sequences N having n neighbors was plotted over the degree n.

To derive the fractal network dimension D_f of identity-based sequence networks, the number of sequence pairs p(d) that differed by less than d%, with d% = (100—sequence identity)%, was computed for pairwise sequence identities determined by USEARCH [25]. The respective fractal network dimension D_f was calculated assuming p(d) ~ d^Df and plotting log(p(d)) over log(d) for d = 2, 4, 6, …, 100. In addition, p(d) was determined for the point-mutation network of TEM β-lactamases with d = 1, 2, 3, 4 point mutations.

Results

All members of a protein family are related to each other by their global sequence identity obtained from pairwise sequence alignments. This relationship was analyzed by constructing networks where the nodes represent individual sequences and the edges represent a neighborhood relationship. Two types of neighborhood relationships were applied for network construction. In identity-based networks, an edge is formed between a pair of sequences if their global sequence identity exceeds a threshold. By adjusting the sequence identity threshold, the construction of identity-based sequence networks was feasible for all homologous protein families and resulted in connected networks for each family. In the rare case of protein families with high microdiversity, such as the TEM β-lactamase family, a second network type was constructed, where an edge between two nodes was formed if the two sequences differed by a point mutation. If such a point mutation network is feasible, it is expected to be highly similar to the respective identity-based network with high sequence identity threshold.

Network models for TEM β-lactamases, a family of high microdiversity

The TEM β-lactamase family is a large protein family of high microdiversity [16]. A point mutation network was constructed for variants of the TEM β-lactamase core region, resulting in 267 nodes and 401 edges (Fig 1). The number of neighbors varied widely for each node. While there were two highly connected hubs (TEM-1 with 86 and TEM-116 with 55 neighbor sequences), most nodes had only few neighbors. The network properties were characterized by calculating the degree distribution, and the number N of nodes with n neighbors followed a power law distribution N(n)~n^-γ with a scaling exponent γ = 1.2 (Fig 2).

Download:

Fig 1. Sequence network for 267 TEM β-lactamases formed by 401 point mutations (edges) with 259 sequences forming a densely connected network with two hub sequences (TEM-1 depicted as black rectangle, TEM-116 as black oval).

First neighbors of hub sequences are depicted in dark gray, other sequences in white.

https://doi.org/10.1371/journal.pone.0200815.g001

Download:

Fig 2. Distribution of the number of sequences N in a network of TEM β-lactamase point mutations having n first neighbors (Fig 1).

The degree distribution follows a power law with exponent γ = 1.2.

https://doi.org/10.1371/journal.pone.0200815.g002

For comparison, an identity-based network was constructed using a global sequence identity threshold of 99.5% pairwise sequence identity, corresponding to a distance of one point mutation (S1 Fig). The global sequence identity measures the number of mutations between two sequences, but is independent of the number of known sequences between the two sequences. The n network consisted of 267 nodes and 401 edges, too, and its degree distribution followed a power law with a scaling exponent γ which was identical to the point mutation network (S2 Fig).

Alternatively, the degree distributions of the sequence networks were fitted by a Poisson distribution P(λ) and a Gaussian distribution G(μ,σ). In contrast to the power-law distribution, the Poisson and the Gaussian distribution resulted in noticeably qualitative deviations from the experimental data (S2 Fig).

Degree distributions for protein superfamilies with low microdiversity

Except for the TEM β-lactamases, the microdiversities of the protein families were too low to result in connected point mutation networks. Therefore, four protein superfamilies (β-hydroxyacid dehydrogenases/imine reductases, bHAD; thiamine diphosphate-dependent decarboxylases, DC; ω-transaminases, oTA; short-chain dehydrogenases/reductases, SDR) were analyzed by constructing networks based on pairwise sequence identity (Table 1). The protein families differed in their distributions of pairwise sequence identities, which is expected for superfamilies of different sequence length, fold, and domain arrangement (Fig 3).

Download:

Fig 3. Distributions of pairwise global sequence identity for the protein families from Table 1 as determined by high-scoring sequence pairs in USEARCH (20).

https://doi.org/10.1371/journal.pone.0200815.g003

Download:

Table 1. Overview of the analyzed protein family networks by number of nodes (sequences) and maximal degree (number of neighbors) for a 95% sequence identity threshold, with average sequence length.

https://doi.org/10.1371/journal.pone.0200815.t001

Sequence pairs with a global sequence identity ≥95% were defined as neighbors. For the four identity-based networks, the degree distribution was approximated by a power law distribution N(n)~n^-γ, whereas the distributions deviated from a scale-free behavior for the most highly connected nodes (Fig 4). Thus, data for degrees ≥50 or 70 were excluded from linear regression, resulting in scaling exponents of γ = 1.2–1.3 (Table 2). The power law distribution was maintained upon decreasing the global sequence identity thresholds for the construction of identity-based networks to ≥90%, ≥85%, or ≥80% (S3–S5 Figs), and the scaling exponents γ decreased slightly with decreasing threshold to γ = 0.9–1.1. Furthermore, subsets between 10% and 90% randomly selected sequences from the DC superfamily resulted in similar scaling exponents γ between 1.1 and 1.4 (S1 Table).

Download:

Fig 4. Neighbor distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥95% global sequence identity.

The corresponding scale-free exponents γ were derived from linear regression for degrees ≤ 50 (bHAD, DC) or ≤ 70 (oTA, SDR) and are summarized in Table 2.

https://doi.org/10.1371/journal.pone.0200815.g004

Download:

Table 2. Overview of the analyzed protein families from Table 1 and their derived parameters.

https://doi.org/10.1371/journal.pone.0200815.t002

The inhomogeneous power law degree distributions of identity-based sequence networks point to the existence of highly connected hubs in the sequence space of the four protein superfamilies (Table 3). Instead of individual hub sequences, communities of highly connected nodes with similar degrees were identified in the identity-based networks. For the DC superfamily, the 100 most highly connected protein sequences had between 250 and 266 neighboring sequences. Upon random selection of a subset of protein sequences from the DC superfamily, the respective sequences with the highest number of neighboring sequences were found to be highly similar, unless very small subsets were analyzed (S2 Table).

Download:

Table 3. Exemplary network hubs and their annotations from sequence networks with a threshold of 95% sequence identity (99.5% for TEM β-lactamases)^a for the protein families from Table 1.

https://doi.org/10.1371/journal.pone.0200815.t003

Dimensions of protein sequence networks

As a further network property, the fractal network dimension D_f was evaluated by counting the number of sequence pairs p(d) that differed by less than d% (100%—sequence identity) for d = 2, 4, 6, … (Fig 5). For low values of d (d≤10%, i.e. ≥90% identity), log p(d) increased linearly with log d, resulting in a network dimension D_f = 0.7–1.0 for the four superfamilies with low microdiversity (Table 2). Random selection of a subset of protein sequences from the DC superfamily lead to almost identical values for of D_f ≈ 0.7 for d≤10% (S6 Fig). For increasing distance d, the network dimension D_f increased to D_f = 3.5–4.5 for 30%≤d≤70%. For the family of TEM β-lactamases, D_f was estimated to 1.8 from the values at d = 2% and d = 4%. Because of the high sequence identities of the members of the TEM β-lactamase family, only few sequence pairs showed distances higher than 4% identity. Estimating the fractal network dimension for the point-mutation network of TEM β-lactamases by comparing the number of single and double mutants resulted in a higher value of D_f = 4.0 (S7 Fig). Beyond double mutants, the limited network size resulted in an apparent decrease of the network dimension, and the analysis of double, triple, and quadruple mutants resulted in D_f = 1.8, as observed for the identity-based TEM β-lactamase network.

Download:

Fig 5. Cumulative distributions of sequence pairs p(d) for pairwise distances of d% of the protein families TEM (open squares), DC (filled squares), bHAD (open circles), SDR (diamonds) and oTA (filled circles) from Table 1 in subsequent distance intervals of 2% distance d (100%—sequence identity).

Linear fits are shown as red lines for distances up to 10% identity (up to 4% for TEM). For further distances between 70 and 30%, an approximately linear area is depicted in red.

https://doi.org/10.1371/journal.pone.0200815.g005

Discussion

The dimension of protein sequence space

The evolution of protein sequences occurs in iterative steps of random mutagenesis of the genotype and subsequent selection of the phenotype. Therefore, the sequence space that has been iteratively explored during 4 Gyr of evolution is expected to be connected [29]. Since the number of explored protein sequences (10⁴⁰) is much smaller than the number of theoretical sequences (>10³⁰⁰), the dimension of the sub-space of extant protein sequences is expected to be much smaller than the multi-thousand dimensional space of theoretical sequences. An estimation of the dimension of the known sequence space was achieved by counting the numbers of neighbors at increasing distances. The fractal network dimension D_f of a protein family was similar among the investigated protein families. D_f varied between 0.7 and 1.0 for sequence identities between 98% and 90%, whereas D_f increased to values between 3.5 and 4.5 at lower sequence identities between 70% and 30%. The observation of a distance-dependent fractal dimension of sequence space gives an interesting insight into the sequence-function relationships of proteins. For uncorrelated random mutations, it has been estimated that the probability of protein inactivation is 34% for each mutation [30]. Therefore, for a small number of mutations, the chance of finding active mutants is high (0.66² = 44% and 0.66⁴ = 19% for two and four mutations, respectively). Thus, many combinations of random mutations result in active proteins, and D_f ≈ 4.0 as evaluated for the point-mutation network of TEM β-lactamases is a lower limit of the dimension of the extant sequence space for a small number of mutations, because D_f is expected to further increase as more TEM β-lactamase sequences are discovered in the future. In contrast, if 10% of all positions are randomly exchanged, the chance of finding an active variant of a 300 amino acid protein reduces to 0.66³⁰ = 4·10⁻⁶. Therefore, the mutations that result in an active protein must be highly correlated, and evolution is dominated by the non-additive effects of epistasis [12]. The high correlation of mutations is compatible with the much lower fractal network dimension D_f = 0.7–1.0, which seems to be a generic property of all investigated protein families. For lower sequence identities between 70 and 30%, the mutations become more uncoupled, which results in a considerable increase of the fractal network dimension D_f.

At a first glance, scale-dependent network dimensions are counter-intuitive. However, scale-dependent spatial dimensions have also been observed for physical systems such as turbulent interfaces [31] and for the distribution of luminous matter in the universe [32]. Although the analysis of the distance dependence of protein sequence space is based on a relatively small number of known sequences, it provides quantitative estimates which are in agreement with known sequence-function relationships [30]. It will be interesting to see how D_f develops in the future, when many more protein sequences become known.

Evolutionary constraints for protein sequence space

Two complementary neighborhood definitions were applied to construct sequence networks. A network construction based on point mutations allows for an interpretation of alternative evolutionary paths along the network [16]. However, mutation-based networks are restricted to the rare families with high microdiversity such as TEM β-lactamases. In contrast, the metric of global sequence identity can be applied to all protein families. For TEM β-lactamases, the mutation-based and the identity-based degree distributions were identical and were approximated by a power law distribution with a scaling exponent γ = 1.2. A power law degree distribution was also observed for four protein families with low microdiversity (bHAD, DC, oTA, SDR) when using the distance metrics. Although the four families have different structural folds, domain arrangements, and sequence lengths, and differ in their level of sequence diversity (Fig 3) and their size (Table 1), they resulted in similar scaling exponents γ = 1.2–1.4. The observation that different protein families show similar scaling exponents indicates that the constraints that govern protein evolution are similar for all proteins [20].

Scale-free distributions of protein families have been described previously for networks of co-occurring protein domains and networks of sequence motifs, with scaling exponents γ in the range from 1.7 to 2.0 [11,33]. By clustering sequences into homologous families, scale-free cluster size distributions have been observed with scaling exponents between 1.6 and 2.5 [8,10,34,35]. It has been suggested that cluster size distribution is a direct consequence of the necessity for a functional protein to fold into a stable structure [9]. As a consequence, sequence space is highly connected, as seen for families with high microdiversity [16]. Connectivity is also related to findability of genotypes [36]. Stability against random errors, another feature attributed to scale-free networks, is also favorable during evolution [37].

Pitfalls and limitations for protein sequence networks

While scale-free distributions seem to be ubiquitous in many domains of life sciences, care should be taken when drawing far-reaching conclusions which are not supported by the data [19]. Therefore, the goodness of the power law fit was compared to alternative fits by Poisson and Gaussian distributions. While the parameters of the Poisson and Gaussian distributions could be adjusted to follow the data in the tail, they fail to describe the monotonous increase of the number of nodes at decreasing degrees, and thus confirm the power law fit [19,20]. However, the limited number of sequences per protein family and the small fraction (10⁻²⁰) of known protein sequences [14] are two factors that favor the tendency to form a power law distribution, because it has been observed that binning of the data has the tendency to form a power law distribution [38] and that sub-networks tend to exhibit a power law distribution, irrespective of the topological property of the larger network they were sampled from [39]. By analyzing randomly selected sub-networks, we demonstrated that the scaling exponent was robust upon resampling, thus excluding the possibility that the scaling exponent might differ between network and sub-networks [40]. However, there is still a risk that the apparent power law distribution might result from a sampling artefact. As the number of newly sequenced genomes is rapidly expanding in the near future, it will be interesting to see whether the degree distribution is robust upon better sampling of the sequence space.

Implications for protein evolution and protein engineering

Protein networks with a highly inhomogeneous, exponential degree distribution with a long tail have another interesting consequence: the existence of a few highly connected nodes. These hubs are sequences or groups of sequences with a very large number of potentially functional neighbors.

The role of hubs in evolution is still under discussion. It has been suggested that highly connected nodes originated early in evolution [41], while less connected nodes are recent results from divergent evolution [42]. This interpretation of "the old get richer" is based on preferential attachment network models [18]. However, preferential attachment is only one way to generate networks, and there are different network topologies which all result in a power law degree distribution [19]. As a consequence, the most highly connected protein sequences are not necessarily the phylogenetically oldest, thus hub sequences should not be interpreted as ancestors. By assuming that evolution has reached an equilibrium in protein sequence space, the more evolvable folds might have become densely populated as a consequence of convergent evolution [42], thus connecting the concept of hubs to the concept of evolvability. The observation of a uniform distribution of sequences from thermophilic and hyperthermophilic sources in the oTA network demonstrated that hub sequences are not characterized by increased thermostability (S2 Fig in [43]).

Evolvability of a protein sequence has two aspects: robustness toward possible deleterious effects of mutations and innovability, where additional mutations readily induce new functions [21]. Since the hub sequences have many supposedly functional neighbors, they have proven to be highly evolvable. Interestingly, some hub proteins have a pivotal role in metabolism. The E1 subunit of the pyruvate dehydrogenase complex, a hub of the DC network, is also a hub in the metabolism linking glycolysis and citric acid cycle [44,45]. The aspartate aminotransferase, a hub of the oTA network, links the amino acid and the carbohydrate metabolisms [46]. These coincidences of hubs in sequence networks and metabolic networks could point at a higher robustness against mutations to preserve cellular function.

The concept of hubs can also be applied to improve the efficiency of directed evolution experiments. Directed evolution is a powerful and widely applied strategy for improving biochemical and biophysical properties of proteins by applying iterative rounds of random mutations and screening. However, multiple random mutations tend to result in inactive proteins with a probability of 92% for only six random mutations [30]. Therefore, it has been suggested to start a directed evolution experiment either from a population of neutral mutants [47] or by constructing ancestor sequences [48] which are believed to have a higher robustness and thus higher evolvability than contemporary sequences [49]. As a promising alternative, we suggest to use the hub sequences as promising starting points in directed evolution experiments and to select highly evolvable homologues directly from the pool of contemporary sequences.

Supporting information

S1 Table. Scaling exponents γ for randomly selected subnetworks of the DC superfamily, with edges formed by a threshold of 95% pairwise sequence identity.

Linear regressions were performed up to a limited number of neighbors only, due to low sampling quality for higher degrees. Thus, values for γ were determined up to a maximum degree.

https://doi.org/10.1371/journal.pone.0200815.s001

(PDF)

S2 Table. Exemplary protein sequences found in hub regions of the DC networks for varying subsets of randomly selected sequences.

The Annotations are listed as “pyruvate dihydrogen-ase subunit” (PDH), “glyoxylate carboligase” (GLX) or “acetolactase synthase 2 catalytic subunit” (ALS). Pairwise sequence identities towards the hub sequence of the complete net-work (WP_044256366) are given in the column on the right.

https://doi.org/10.1371/journal.pone.0200815.s002

(PDF)

S1 Fig. Sequence network for 267 TEM β-lactamases connected by 401 edges above a 99.5% pairwise sequence identity threshold, in comparison to the point mutation network (Fig 1).

Hub sequences are depicted in black (TEM-1 as black rectangle, TEM-116 as black oval) with their first neighbors depicted in dark gray, other sequences in white.

https://doi.org/10.1371/journal.pone.0200815.s003

(TIF)

S2 Fig. Distribution of the number of sequences N having n first neighbors for the distance-based network of TEM β-lactamases (S1 Fig).

The degree distribution hints at a power law distribution with exponent γ = 1.2 (a). In addition, probability density functions were fitted for a power-law distribution (line, γ = 1.2), a Gaussian distribution (dashed line, μ = 3.0, σ = 6.4) and a Poisson distribution (dotted line, λ = 3.0) with residual sum of squares 0.01, 0.2 and 0.1, respectively (b-d).

https://doi.org/10.1371/journal.pone.0200815.s004

(TIF)

S3 Fig. Degree distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥ 90% global sequence identity.

Linear regression was performed for degrees ≤ 50 (bHAD, DC) or ≤ 70 (oTA, SDR).

https://doi.org/10.1371/journal.pone.0200815.s005

(TIF)

S4 Fig. Degree distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥ 85% global sequence identity.

Linear regression was performed for degrees ≤ 50 (bHAD, DC) or ≤ 70 (oTA, SDR).

https://doi.org/10.1371/journal.pone.0200815.s006

(TIF)

S5 Fig. Degree distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥ 80% global sequence identity.

Linear regression was performed for degrees ≤ 50 (bHAD, DC) or ≤ 70 (oTA, SDR).

https://doi.org/10.1371/journal.pone.0200815.s007

(TIF)

S6 Fig. Cumulative distributions of sequence pairs p(d) for pairwise distances of d % of the DC protein superfamily for the complete data set (open squares) and 10%, 20%, …, 90% randomly selected subsets of sequences (filled squares).

The areas marked in red correspond to the linear approximations from Fig 5.

https://doi.org/10.1371/journal.pone.0200815.s008

(TIF)

S7 Fig. Cumulative distribution of sequence pairs p(d) with distance in d point mutations of TEM β-lactamases (open squares).

Linear fits are shown for d = 1,2 (red line) and for d = 2,3,4 point mutations (blue line).

https://doi.org/10.1371/journal.pone.0200815.s009

(TIF)

Acknowledgments

We thank Jannik Seidel for his explorative studies, Silvia Fademrecht for providing the sequence family databases, and Uta Freiberg for inspiring discussions.

References

1. Newman MEJ. Power laws, Pareto distributions and Zipf’s law. Contemp Phys. 2005;46: 323–351.
- View Article
- Google Scholar
2. West GB, Brown JH. Life’s universal scaling laws. Phys Today. 2004;57: 36–43.
- View Article
- Google Scholar
3. Manhart M, Haldane A, Morozov A V. A universal scaling law determines time reversibility and steady state of substitutions under selection. Theor Popul Biol. 2012;82: 66–76. pmid:22838027
- View Article
- PubMed/NCBI
- Google Scholar
4. Moret MA, Santana MC, Zebende GF, Pascutti PG. Self-similarity and protein compactness. Phys Rev E—Stat Nonlinear, Soft Matter Phys. 2009;80: 1–4. pmid:19905343
- View Article
- PubMed/NCBI
- Google Scholar
5. Reuveni S, Granek R, Klafter J. Proteins: coexistence of stability and flexibility. Phys Rev Lett. 2008;100: 1–4. pmid:18518581
- View Article
- PubMed/NCBI
- Google Scholar
6. Tang Q- Y, Zhang Y- Y, Wang J, Wang W, Chialvo DR. Critical fluctuations in the native state of proteins. Phys Rev Lett. 2017;118: 1–5. pmid:28282168
- View Article
- PubMed/NCBI
- Google Scholar
7. Mora T, Bialek W. Are biological systems poised at criticality? J Stat Phys. 2011;144: 268–302.
- View Article
- Google Scholar
8. Enright AJ, Kunin V, Ouzounis CA. Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 2003;31: 4632–4638. pmid:12888524
- View Article
- PubMed/NCBI
- Google Scholar
9. Deeds EJ, Dokholyan N V., Shakhnovich EI. Protein evolution within a structural space. Biophys J. 2003;85: 2962–2972. pmid:14581198
- View Article
- PubMed/NCBI
- Google Scholar
10. Koonin E V., Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature. 2002;420: 218–223. pmid:12432406
- View Article
- PubMed/NCBI
- Google Scholar
11. Wuchty S. Scale-free behavior in protein domain networks. Mol Biol Evol. 2001;18: 1694–1702. pmid:11504849
- View Article
- PubMed/NCBI
- Google Scholar
12. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife. 2016;5. pmid:27391790
- View Article
- PubMed/NCBI
- Google Scholar
13. Consortium UniProt. UniProt: a hub for protein information. Nucleic Acids Res. 2014;43: D204–D212. pmid:25348405
- View Article
- PubMed/NCBI
- Google Scholar
14. Dryden DTF, Thomson AR, White JH. How much of protein sequence space has been explored by life on Earth? J R Soc Interface. 2008;5: 953–956. pmid:18426772
- View Article
- PubMed/NCBI
- Google Scholar
15. Widmann M, Pleiss J. Protein variants form a system of networks: Microdiversity of IMP metallo-beta-lactamases. PLoS One. 2014;9. pmid:25013948
- View Article
- PubMed/NCBI
- Google Scholar
16. Zeil C, Widmann M, Fademrecht S, Vogel C, Pleiss J. Network analysis of sequence-function relationships and exploration of sequence space of TEM beta-lactamases. Antimicrob Agents Chemother. 2016;60: 2709–2717. pmid:26883706
- View Article
- PubMed/NCBI
- Google Scholar
17. Jacoby GA, Bush K. The Curious Case of TEM-116. Antimicrob Agents Chemother. 2016;60: 7000–7000. pmid:28045664
- View Article
- PubMed/NCBI
- Google Scholar
18. Barabasi A- L, Albert R. Emergence of scaling in random networks. Science. 1999;286: 509–512. pmid:10521342
- View Article
- PubMed/NCBI
- Google Scholar
19. Lima-Mendez G, van Helden J. The powerful law of the power law and other myths in network biology. Mol Biosyst. 2009;5: 1482–1493. pmid:20023717
- View Article
- PubMed/NCBI
- Google Scholar
20. Keller EF. Revisiting “scale-free” networks. BioEssays. 2005;27: 1060–1068. pmid:16163729
- View Article
- PubMed/NCBI
- Google Scholar
21. Dellus-Gur E, Toth-Petroczy A, Elias M, Tawfik DS. What makes a protein fold amenable to functional innovation? Fold polarity and stability trade-offs. J Mol Biol. Elsevier B.V.; 2013;425: 2609–2621. pmid:23542341
- View Article
- PubMed/NCBI
- Google Scholar
22. Tokuriki N, Tawfik DS. Protein dynamism and evolvability. Science. 2009;324: 203–207. pmid:19359577
- View Article
- PubMed/NCBI
- Google Scholar
23. Dellus-Gur E, Elias M, Caselli E, Prati F, Salverda MLM, de Visser JAGM, et al. Negative epistasis and evolvability in TEM-1 β-lactamase—The thin line between an enzyme’s conformational freedom and disorder. J Mol Biol. Elsevier Ltd; 2015;427: 2396–2409. pmid:26004540
- View Article
- PubMed/NCBI
- Google Scholar
24. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, et al. GenBank. Nucleic Acids Res. 2017;46: D41–D47. pmid:29140468
- View Article
- PubMed/NCBI
- Google Scholar
25. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26: 2460–2461. pmid:20709691
- View Article
- PubMed/NCBI
- Google Scholar
26. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. Elsevier; 1970;48: 443–453. pmid:5420325
- View Article
- PubMed/NCBI
- Google Scholar
27. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16: 276–277. pmid:10827456
- View Article
- PubMed/NCBI
- Google Scholar
28. Tange O. GNU parallel: the command-line power tool. login USENIX Mag. 2011;36: 42–47.
- View Article
- Google Scholar
29. Smith JM. Natural selection and the concept of a protein space. Nature. 1970;225: 563–564. pmid:5411867
- View Article
- PubMed/NCBI
- Google Scholar
30. Guo HH, Choe J, Loeb LA. Protein tolerance to random amino acid change. Proc Natl Acad Sci U S A. 2004;101: 9205–9210. pmid:15197260
- View Article
- PubMed/NCBI
- Google Scholar
31. Catrakis HJ, Dimotakis PE. Scale distributions and fractal dimensions in turbulence. Phys Rev Lett. 1996;77: 3795–3798. pmid:10062310
- View Article
- PubMed/NCBI
- Google Scholar
32. Bak P, Chen K. Scale dependent dimension of luminous matter in the universe. Phys Rev Lett. 2001;86: 4215–4218. pmid:11328138
- View Article
- PubMed/NCBI
- Google Scholar
33. Aziz MF, Caetano-Anollés K, Caetano-Anollés G. The early history and emergence of molecular functions and modular scale-free network behavior. Sci Rep. Nature Publishing Group; 2016;6. pmid:27121452
- View Article
- PubMed/NCBI
- Google Scholar
34. Orengo CA, Thornton JM. Protein families and their evolution—a structural perspective. Annu Rev Biochem. 2005;74: 867–900. pmid:15954844
- View Article
- PubMed/NCBI
- Google Scholar
35. Buchholz PCF, Fademrecht S, Pleiss J. Percolation in protein sequence space. PLoS One. Public Library of Science; 2017;12. pmid:29261740
- View Article
- PubMed/NCBI
- Google Scholar
36. McCandlish DM. On the findability of genotypes. Evolution (N Y). 2013;67: 2592–2603. pmid:24033169
- View Article
- PubMed/NCBI
- Google Scholar
37. Albert R, Barabasi A- L. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74: 47–97.
- View Article
- Google Scholar
38. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási A- L. The large-scale organization of metabolic networks. Nature. 2000;407: 651–654. pmid:11034217
- View Article
- PubMed/NCBI
- Google Scholar
39. Han J- DJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang L V, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004;430: 88–93. pmid:15190252
- View Article
- PubMed/NCBI
- Google Scholar
40. Stumpf MPH, Wiuf C, May RM. Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proc Natl Acad Sci. 2005;102: 4221–4224. pmid:15767579
- View Article
- PubMed/NCBI
- Google Scholar
41. Fell DA, Wagner A. The small world of metabolism. Nat Biotechnol. 2000;18: 1121–1122. pmid:11062388
- View Article
- PubMed/NCBI
- Google Scholar
42. Dokholyan N V., Shakhnovich B, Shakhnovich EI. Expanding protein universe and its origin from the biological Big Bang. Proc Natl Acad Sci. 2002;99: 14132–14136. pmid:12384571
- View Article
- PubMed/NCBI
- Google Scholar
43. Buß O, Buchholz PCF, Gräff M, Klausmann P, Rudat J, Pleiss J. The ω-transaminase engineering database (oTAED): a navigation tool in protein sequence and structure space. Proteins Struct Funct Bioinforma. 2018;86: 566–580. pmid:29423963
- View Article
- PubMed/NCBI
- Google Scholar
44. Gray LR, Tompkins SC, Taylor EB. Regulation of pyruvate metabolism and human disease. Cell Mol Life Sci. 2014;71: 2577–2604. pmid:24363178
- View Article
- PubMed/NCBI
- Google Scholar
45. Zhang S, Hulver MW, McMillan RP, Cline MA, Gilbert ER. The pivotal role of pyruvate dehydrogenase kinases in metabolic flexibility. Nutr Metab. 2014;11. pmid:24520982
- View Article
- PubMed/NCBI
- Google Scholar
46. Korla K, Vadlakonda L, Mitra CK. Kinetic simulation of malate-aspartate and citrate-pyruvate shuttles in association with Krebs cycle. J Biomol Struct Dyn. Taylor & Francis; 2015;33: 2390–2403. pmid:25559761
- View Article
- PubMed/NCBI
- Google Scholar
47. Gupta RD, Tawfik DS. Directed enzyme evolution via small and effective neutral drift libraries. Nat Methods. 2008;5: 939–942. pmid:18931667
- View Article
- PubMed/NCBI
- Google Scholar
48. Merkl R, Sterner R. Ancestral protein reconstruction: techniques and applications. Biol Chem. 2016;397: 1–21. pmid:26351909
- View Article
- PubMed/NCBI
- Google Scholar
49. Gaucher EA, Govindarajan S, Ganesh OK. Palaeotemperature trend for Precambrian life inferred from resurrected proteins. Nature. 2008;451: 704–707. pmid:18256669
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Newman MEJ. Power laws, Pareto distributions and Zipf’s law. Contemp Phys. 2005;46: 323–351.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. West GB, Brown JH. Life’s universal scaling laws. Phys Today. 2004;57: 36–43.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Manhart M, Haldane A, Morozov A V. A universal scaling law determines time reversibility and steady state of substitutions under selection. Theor Popul Biol. 2012;82: 66–76. pmid:22838027
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Moret MA, Santana MC, Zebende GF, Pascutti PG. Self-similarity and protein compactness. Phys Rev E—Stat Nonlinear, Soft Matter Phys. 2009;80: 1–4. pmid:19905343
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Reuveni S, Granek R, Klafter J. Proteins: coexistence of stability and flexibility. Phys Rev Lett. 2008;100: 1–4. pmid:18518581
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Tang Q- Y, Zhang Y- Y, Wang J, Wang W, Chialvo DR. Critical fluctuations in the native state of proteins. Phys Rev Lett. 2017;118: 1–5. pmid:28282168
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Mora T, Bialek W. Are biological systems poised at criticality? J Stat Phys. 2011;144: 268–302.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref8] 8. Enright AJ, Kunin V, Ouzounis CA. Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 2003;31: 4632–4638. pmid:12888524
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Deeds EJ, Dokholyan N V., Shakhnovich EI. Protein evolution within a structural space. Biophys J. 2003;85: 2962–2972. pmid:14581198
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Koonin E V., Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature. 2002;420: 218–223. pmid:12432406
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Wuchty S. Scale-free behavior in protein domain networks. Mol Biol Evol. 2001;18: 1694–1702. pmid:11504849
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife. 2016;5. pmid:27391790
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Consortium UniProt. UniProt: a hub for protein information. Nucleic Acids Res. 2014;43: D204–D212. pmid:25348405
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref14] 14. Dryden DTF, Thomson AR, White JH. How much of protein sequence space has been explored by life on Earth? J R Soc Interface. 2008;5: 953–956. pmid:18426772
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Widmann M, Pleiss J. Protein variants form a system of networks: Microdiversity of IMP metallo-beta-lactamases. PLoS One. 2014;9. pmid:25013948
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref16] 16. Zeil C, Widmann M, Fademrecht S, Vogel C, Pleiss J. Network analysis of sequence-function relationships and exploration of sequence space of TEM beta-lactamases. Antimicrob Agents Chemother. 2016;60: 2709–2717. pmid:26883706
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref17] 17. Jacoby GA, Bush K. The Curious Case of TEM-116. Antimicrob Agents Chemother. 2016;60: 7000–7000. pmid:28045664
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref18] 18. Barabasi A- L, Albert R. Emergence of scaling in random networks. Science. 1999;286: 509–512. pmid:10521342
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref19] 19. Lima-Mendez G, van Helden J. The powerful law of the power law and other myths in network biology. Mol Biosyst. 2009;5: 1482–1493. pmid:20023717
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref20] 20. Keller EF. Revisiting “scale-free” networks. BioEssays. 2005;27: 1060–1068. pmid:16163729
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref21] 21. Dellus-Gur E, Toth-Petroczy A, Elias M, Tawfik DS. What makes a protein fold amenable to functional innovation? Fold polarity and stability trade-offs. J Mol Biol. Elsevier B.V.; 2013;425: 2609–2621. pmid:23542341
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref22] 22. Tokuriki N, Tawfik DS. Protein dynamism and evolvability. Science. 2009;324: 203–207. pmid:19359577
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref23] 23. Dellus-Gur E, Elias M, Caselli E, Prati F, Salverda MLM, de Visser JAGM, et al. Negative epistasis and evolvability in TEM-1 β-lactamase—The thin line between an enzyme’s conformational freedom and disorder. J Mol Biol. Elsevier Ltd; 2015;427: 2396–2409. pmid:26004540
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref24] 24. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, et al. GenBank. Nucleic Acids Res. 2017;46: D41–D47. pmid:29140468
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref25] 25. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26: 2460–2461. pmid:20709691
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref26] 26. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. Elsevier; 1970;48: 443–453. pmid:5420325
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref27] 27. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16: 276–277. pmid:10827456
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref28] 28. Tange O. GNU parallel: the command-line power tool. login USENIX Mag. 2011;36: 42–47.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref29] 29. Smith JM. Natural selection and the concept of a protein space. Nature. 1970;225: 563–564. pmid:5411867
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref30] 30. Guo HH, Choe J, Loeb LA. Protein tolerance to random amino acid change. Proc Natl Acad Sci U S A. 2004;101: 9205–9210. pmid:15197260
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref31] 31. Catrakis HJ, Dimotakis PE. Scale distributions and fractal dimensions in turbulence. Phys Rev Lett. 1996;77: 3795–3798. pmid:10062310
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref32] 32. Bak P, Chen K. Scale dependent dimension of luminous matter in the universe. Phys Rev Lett. 2001;86: 4215–4218. pmid:11328138
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref33] 33. Aziz MF, Caetano-Anollés K, Caetano-Anollés G. The early history and emergence of molecular functions and modular scale-free network behavior. Sci Rep. Nature Publishing Group; 2016;6. pmid:27121452
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref34] 34. Orengo CA, Thornton JM. Protein families and their evolution—a structural perspective. Annu Rev Biochem. 2005;74: 867–900. pmid:15954844
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref35] 35. Buchholz PCF, Fademrecht S, Pleiss J. Percolation in protein sequence space. PLoS One. Public Library of Science; 2017;12. pmid:29261740
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref36] 36. McCandlish DM. On the findability of genotypes. Evolution (N Y). 2013;67: 2592–2603. pmid:24033169
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref37] 37. Albert R, Barabasi A- L. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74: 47–97.
View Article
Google Scholar

[142] View Article

[143] Google Scholar

[ref38] 38. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási A- L. The large-scale organization of metabolic networks. Nature. 2000;407: 651–654. pmid:11034217
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref39] 39. Han J- DJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang L V, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004;430: 88–93. pmid:15190252
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref40] 40. Stumpf MPH, Wiuf C, May RM. Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proc Natl Acad Sci. 2005;102: 4221–4224. pmid:15767579
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref41] 41. Fell DA, Wagner A. The small world of metabolism. Nat Biotechnol. 2000;18: 1121–1122. pmid:11062388
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref42] 42. Dokholyan N V., Shakhnovich B, Shakhnovich EI. Expanding protein universe and its origin from the biological Big Bang. Proc Natl Acad Sci. 2002;99: 14132–14136. pmid:12384571
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref43] 43. Buß O, Buchholz PCF, Gräff M, Klausmann P, Rudat J, Pleiss J. The ω-transaminase engineering database (oTAED): a navigation tool in protein sequence and structure space. Proteins Struct Funct Bioinforma. 2018;86: 566–580. pmid:29423963
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref44] 44. Gray LR, Tompkins SC, Taylor EB. Regulation of pyruvate metabolism and human disease. Cell Mol Life Sci. 2014;71: 2577–2604. pmid:24363178
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref45] 45. Zhang S, Hulver MW, McMillan RP, Cline MA, Gilbert ER. The pivotal role of pyruvate dehydrogenase kinases in metabolic flexibility. Nutr Metab. 2014;11. pmid:24520982
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref46] 46. Korla K, Vadlakonda L, Mitra CK. Kinetic simulation of malate-aspartate and citrate-pyruvate shuttles in association with Krebs cycle. J Biomol Struct Dyn. Taylor & Francis; 2015;33: 2390–2403. pmid:25559761
View Article
PubMed/NCBI
Google Scholar

[177] View Article

[178] PubMed/NCBI

[179] Google Scholar

[ref47] 47. Gupta RD, Tawfik DS. Directed enzyme evolution via small and effective neutral drift libraries. Nat Methods. 2008;5: 939–942. pmid:18931667
View Article
PubMed/NCBI
Google Scholar

[181] View Article

[182] PubMed/NCBI

[183] Google Scholar

[ref48] 48. Merkl R, Sterner R. Ancestral protein reconstruction: techniques and applications. Biol Chem. 2016;397: 1–21. pmid:26351909
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref49] 49. Gaucher EA, Govindarajan S, Ganesh OK. Palaeotemperature trend for Precambrian life inferred from resurrected proteins. Nature. 2008;451: 704–707. pmid:18256669
View Article
PubMed/NCBI
Google Scholar

[189] View Article

[190] PubMed/NCBI

[191] Google Scholar

Figures

Abstract

Introduction

Methods

Datasets of protein sequences

Sequence alignments and sequence networks

Degree distribution and fractal network dimension

Results

Network models for TEM β-lactamases, a family of high microdiversity

Degree distributions for protein superfamilies with low microdiversity

Dimensions of protein sequence networks

Discussion

The dimension of protein sequence space

Evolutionary constraints for protein sequence space

Pitfalls and limitations for protein sequence networks

Implications for protein evolution and protein engineering

Supporting information

S1 Table. Scaling exponents γ for randomly selected subnetworks of the DC superfamily, with edges formed by a threshold of 95% pairwise sequence identity.

S2 Table. Exemplary protein sequences found in hub regions of the DC networks for varying subsets of randomly selected sequences.

S1 Fig. Sequence network for 267 TEM β-lactamases connected by 401 edges above a 99.5% pairwise sequence identity threshold, in comparison to the point mutation network (Fig 1).

S2 Fig. Distribution of the number of sequences N having n first neighbors for the distance-based network of TEM β-lactamases (S1 Fig).

S3 Fig. Degree distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥ 90% global sequence identity.

S4 Fig. Degree distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥ 85% global sequence identity.

S5 Fig. Degree distribution for the protein families with low microdiversity from Table 1 with neighbors defined by ≥ 80% global sequence identity.

S6 Fig. Cumulative distributions of sequence pairs p(d) for pairwise distances of d % of the DC protein superfamily for the complete data set (open squares) and 10%, 20%, …, 90% randomly selected subsets of sequences (filled squares).

S7 Fig. Cumulative distribution of sequence pairs p(d) with distance in d point mutations of TEM β-lactamases (open squares).

Acknowledgments

References