Abstract
The transition state ensemble during the folding process of globular proteins occurs when a sufficient number of intrachain contacts are formed, mainly, but not exclusively, due to hydrophobic interactions. These contacts are related to the folding nucleus, and they contribute to the stability of the native structure, although they may disappear after the energetic barrier of transition states has been passed. A number of structure and sequence analyses, as well as protein engineering studies, have shown that the signature of the folding nucleus is surprisingly present in the native three-dimensional structure, in the form of closed loops, and also in the early folding events. These findings support the idea that the residues of the folding nucleus become buried in the very first folding events, therefore helping the formation of closed loops that act as anchor structures, speed up the process, and overcome the Levinthal paradox. We present here a review of an algorithm intended to simulate in a discrete space the early steps of the folding process. It is based on a Monte Carlo simulation where perturbations, or moves, are randomly applied to residues within a sequence. In contrast with many technically similar approaches, this model does not intend to fold the protein but to calculate the number of non-covalent neighbors of each residue, during the early steps of the folding process. Amino acids along the sequence are categorized as most interacting residues (MIRs) or least interacting residues. The MIR method can be applied under a variety of circumstances. In the cases tested thus far, MIR has successfully identified the exact residue whose mutation causes a switch in conformation. This follows with the idea that MIR identifies residues that are important in the folding process. Most MIR positions correspond to hydrophobic residues; correspondingly, MIRs have zero or very low accessible surface area. Alongside the review of the MIR method, we present a new postprocessing method called smoothed MIR (SMIR), which refines the original MIR method by exploiting the knowledge of residue hydrophobicity. We review known results and present new ones, focusing on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the SMIR method.
Acknowledgments
We acknowledge Pierre Tufféry for his help on using the RPBS resources. Mathieu Lonquety and Christophe Legendre contributed to the SPROUTS database where SMIR results are stored. They are all thanked for their help. We also wish to acknowledge our collaborators at ASU: Antonia Papandreou-Suppappola and Anna Malin who have worked on an alternative MIR method, and Banu Ozkan for evaluating SPROUTS functionalities and discussing future improvement.
Author contributions: All authors have accepted responsibility for the entire content of the submitted manuscript and approved the submission.
Research funding: This work was partially supported by the National Science Foundation (grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, IIS 0944126, and CNS 0849980) and by an invitation of the Université Pierre et Marie Curie.
Employment or leadership: None declared.
Honorarium: None declared.
Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. Any opinion, finding, and conclusion or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Appendix
Lattice geometry
We model a protein as a chain of evenly spaced Cα atoms placed on a lattice [14, 45]. We define a lattice unit (lu) to be 1.7 Å. Hence, Cα atoms are connected by vectors of the form (2, 1, 0). These vectors are
Our model does not take into account the presence of side chains; therefore, the required separation is modeled with a 3.8 Å minimum distance requirement. On the basis of chain geometry, we limit the angle between some Cαs at positions i, i+1, and i+2 in a sequence by requiring the distance between them to be from 4.1 to 7.2 Å (or from
To initiate the simulations, 100 different starting conformations (or models) within the lattice are used. Figure 12 displays a sample of these models as a comprehensive plot. These starting conformations were taken from offline computations for chains of 1100 residues. The only requirement is that these randomly computed seed conformations have some level of non-compactness [14, 45]. Starting from the first backbone residue located at position (0, 0, 0), the first n positions in the seed model will be used for an input model with n residues in its sequence. When placing an input protein into the lattice for the first time, each residue in the protein is positioned based on the respective residue in the seed model, i.e., the first residue of the given protein is assigned the position of the first residue in the seed model, the second residue of the given protein is assigned the position of second residue in the seed model, and so on. If the input sequence is shorter than the seed model, then the positions of the final residues in the seed will not be used.
As will be discussed later, a number of possible perturbations may be applied to each of the residues in the model. The model is stored as a set of relative vectors between Cαs, representing their distance relative to the previous Cα. This is useful as we may, before knowing that a move is valid, change the position of a single residue without having to update every other residue that follows it in the chain. For the purposes of energy and neighbor calculations, a displacement vector must be calculated for the absolute position of each residue. We create new conformations by working along the relative positions of residues and accumulating their positions in space, in a neighbor-to-neighbor fashion. Hence, every residue following the one being changed will be translated by the distance between the initial and final positions of the perturbed residue. The resulting models are verified by checking that no non-adjacent residues have come into close proximity (
At the end of each simulation, the algorithm internally produces a model that has collected all of the perturbations that were accepted, based on an energy criterion described later. It then uses the topology of the model to determine the NCN count at each residue. Two of these resulting models are shown in Figure 13 for the PDB code 1asu. Both of the MIR models follow the rough pattern of the starting model; however, one can observe the same globular regions starting to form at the same places (e.g., ends). In previous works, these were called proto fragments [5].
Energy model
Although the protein chain is only modeled as a sequence of Cαs, the effect of side chains is included in energy terms associated with each pair of residue interactions. We assume that inter-residue energies are significant when the distance between them is between 3.8 and 5.88 Å (
We take ER(i) to be the energy at the ith residue in a sequence and calculate it as
where the energy interaction EI is calculated according to the distance between residues (i, j) and the energy matrix PE corresponding to the residue-residue energy interaction [4, 14] that is a function of the type of residue (one of the 20 amino acids). We use type(i) to denote a function from residue index to residue type. Let dist(i,j) compute the Euclidean distance between two points on the lattice.
Monte Carlo algorithm
The core of the MIR algorithm is a Monte Carlo simulation where possible perturbations (moves) are repeatedly applied to an existing conformation and accepted on the basis of standard Metropolis criterion. In the following sections, we detail the concrete MIR implementation. The algorithm is implemented in Fortran 90 and is used in a Linux environment. All random numbers are computed with the “Keep It Simple, Stupid” method [47]. Each time the algorithm is run, the same initial seed value is used; together with the precomputed random initial confirmation, this enables reproducible results. The Monte Carlo simulation is used to generate a total of 100 models.
The limiting number of Monte Carlo steps for each simulation is given by:
where L is the sequence length. During a simulation, we record the state (snapshot) of residue interactions every
A new conformation is accepted if ΔE<0; otherwise, it may be accepted with probability
Residues with NCN of at least 6 are marked as MIRs; any residues with NCN of no more than 2 are marked as LIRs. To reduce statistical fluctuations that can produce successive positions attributed as an MIR, but without physical meaning, a smoothing procedure is implemented on the web server of SPROUTS. On the basis of a Pascal algorithm, it produces a smoothed distribution of NCNs, and the maxima are then considered as SMIRs.
Simulation step
A model is evolving while we have not reached MClimit and can still select perturbations, which change energy between some residues. When we begin processing the model, we randomly select an unseen residue in the sequence to perturb. Call it residue i. We calculate the angle between residues i–1, i, and i+1. We limit the angle between these residues by restricting the distance between them from 4.1 to 7.2 Å (or from
For a simple example of a perturbation, consider those perturbations of length
For distances of
If the new model is valid, the change is applied and the energy value is calculated for the residue that was moved. This local energy result is passed to the energy acceptance function to probabilistically determine if this new model should be accepted. If the model is invalid or is not accepted, then we reset to the previous model. In this case, one begins the process of seeking a residue to perturb again.
References
1. Anfinsen CB. Principles that govern the folding of protein chains. Science 1973;181:223–30.10.1126/science.181.4096.223Search in Google Scholar
2. Anfinsen CB, Haber E. Studies on the reduction and re-formation of protein disulfide bonds. J Biol Chem 1961;236:1361–3.10.1016/S0021-9258(18)64177-8Search in Google Scholar
3. Go N, Taketomi H. Respective roles of short-and long-range interactions in protein folding. Proc Natl Acad Sci USA 1978;75:559–63.10.1073/pnas.75.2.559Search in Google Scholar
4. Miyazawa S, Jernigan RL. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256:623–44.10.1006/jmbi.1996.0114Search in Google Scholar
5. Papandreou N, Kanehisa M, Chomilier J. Folding of the human protein FKBP. Lattice Monte-Carlo simulations. C R Acad Sci III 1998;321:835–43.10.1016/S0764-4469(99)80023-7Search in Google Scholar
6. Skolnick J, Kolinski A. Dynamic Monte Carlo simulations of a new lattice model of globular protein folding, structure and dynamics. J Mol Biol 1991;221:499–531.10.1016/0022-2836(91)80070-BSearch in Google Scholar
7. Skolnick J, Kolinski A, Ortiz AR. Reduced protein models and their application to the protein folding problem. J Biomol Struct Dyn 1998;16:381–96.10.1080/07391102.1998.10508255Search in Google Scholar PubMed
8. Kolinski A, Rotkiewicz P, Skolnick J. Application of a high coordination lattice model in protein structure prediction. In: Proceedings of the Workshop on Monte Carlo approach to biopolymers and protein folding, Singapore: World Scientific, 1998:377–88.Search in Google Scholar
9. Chomilier J, Lamarine M, Mornon JP, Torres JH, Eliopoulos, E, Papandreou N. Analysis of fragments induced by simulated lattice protein folding. C R Biol 2004;327:431–43.10.1016/j.crvi.2004.02.002Search in Google Scholar PubMed
10. Prudhomme N, Chomilier J. Prediction of the protein folding core: application to the immunoglobulin fold. Biochimie 2009;91:1465–74.10.1016/j.biochi.2009.07.016Search in Google Scholar PubMed
11. Lonquety M, Lacroix Z, Chomilier J. Evaluation of the stability of folding nucleus upon mutation. Pattern Recogn Bioinform LNCS 2008;5265:54–65.10.1007/978-3-540-88436-1_5Search in Google Scholar
12. Lonquety M, Lacroix Z, Papandreou N, Chomilier J. SPROUTS: a database for the evaluation of protein stability upon point mutation. Nucleic Acids Res 2009;37:D374–9.10.1093/nar/gkn704Search in Google Scholar
13. Alland C, Moreews F, Boens D, Carpentier M, Chiusa S, Lonquety M, et al. RPBS: a web resource for structural bioinformatics. Nucleic Acids Res 2005;33:W44–9.10.1093/nar/gki477Search in Google Scholar
14. Callebaut I, Labesse G, Durand P, Poupon A, Canard L, Chomilier J, et al. Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell Mol Life Sci 1997;53:621–45.10.1007/s000180050082Search in Google Scholar
15. Acuña R, Lacroix Z, Chomilier J, Papandreou N. SMIR: a method to predict the residues involved in the core of a protein. In: Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP). Las Vegas, USA, 2014.Search in Google Scholar
16. Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science 1985;227:1435–41. doi:10.1126/science.2983426. PMID 298342610.1126/science.2983426Search in Google Scholar
17. Bostock M, Ogievetsky V, Heer J. D3: data-driven documents. IEEE Trans Vis Comput Graph 2011;17:2301–9.10.1109/TVCG.2011.185Search in Google Scholar
18. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995;247:536–40.10.1016/S0022-2836(05)80134-2Search in Google Scholar
19. Hamill S, Steward A, Clarke J. The folding of an immunoglobulin like Greek key protein is defined by a common core nucleus and regions constrained by topology. J Mol Biol 2000;297: 165–78.10.1006/jmbi.2000.3517Search in Google Scholar
20. Berezovsky IN, Grosberg AY, Trifonov EN. Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 2000;466:283–6.10.1016/S0014-5793(00)01091-7Search in Google Scholar
21. Lamarine M, Mornon JP, Berezovsky IN, Chomilier J. Distribution of tightened end fragments of globular proteins statistically match that of topohydrophobic positions: towards an efficient punctuation of protein folding? Cell Mol Life Sci 2001;58:492–8.10.1007/PL00000873Search in Google Scholar PubMed
22. Chintapalli SV, Illingworth CJ, Upton GJ, Sacquin-Mora S, Reeves PJ, Mohammedali HS, et al. Assessing the effect of dynamics on the closed-loop protein-folding hypothesis. J R Soc Interface 2013;11:20130935.10.1098/rsif.2013.0935Search in Google Scholar PubMed PubMed Central
23. Potapov V, Cohen M, Schreiber G. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel 2009;22:553–60.10.1093/protein/gzp030Search in Google Scholar PubMed
24. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res 2005;33:W382–8.10.1093/nar/gki387Search in Google Scholar PubMed PubMed Central
25. Cheng J, Randall A, Baldi P. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins 2006;62:1125–32.10.1002/prot.20810Search in Google Scholar PubMed
26. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002;11:2714–26.10.1110/ps.0217002Search in Google Scholar PubMed PubMed Central
27. Capriotti E, Fariselli P, Casadio R. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 2005;33:W306–10.10.1093/nar/gki375Search in Google Scholar PubMed PubMed Central
28. He Y, Yeh DC, Alexander P, Bryan PN, Orban J. Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry 2005;44:14055–61.10.1021/bi051232jSearch in Google Scholar PubMed
29. Alexander PA, He Y, Chen Y, Orban J, Bryan PN. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci USA 2007;104:11963–8.10.1073/pnas.0700922104Search in Google Scholar PubMed PubMed Central
30. Alexander PA, He Y, Chen Y, Orban J, Bryan PN. A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci USA 2009;106:21149–54.10.1073/pnas.0906408106Search in Google Scholar PubMed PubMed Central
31. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983;22:2577–637.10.1002/bip.360221211Search in Google Scholar PubMed
32. Tsodikov OV, Record, MT, Sergeev YV. A novel computer program for fast exact calculation of accessible and molecular surface areas and average surface curvature. J Comput Chem 2002;23:600–9.10.1002/jcc.10061Search in Google Scholar PubMed
33. Faísca PF, Travasso RD, Parisi A, Rey A. Why do protein folding rates correlate with metrics of native topology? PLoS ONE 2012;7:e35599.10.1371/journal.pone.0035599Search in Google Scholar
34. Travasso RD, Faísca PF, Rey A. The protein folding transition state: insight from kinetics and thermodynamics. J Chem Phys 2010;133:125102.10.1063/1.3485286Search in Google Scholar
35. Lappalainen I, Hurley M, Clarke J. Plasticity within the obligatory folding nucleus of an immunoglobulin like domain. J Mol Biol 2008;375:547–59.10.1016/j.jmb.2007.09.088Search in Google Scholar
36. Galzitskaya OV, Ivankov DN, Finkelstein AV. Folding nuclei in proteins. FEBS Lett 2001;489:113–8.10.1016/S0014-5793(01)02092-0Search in Google Scholar
37. Galzitskaya OV, Skoogarev AV, Ivankov DN, Finkelstein AV. Folding nuclei in 3D protein structures. Pac Symp Biocomput 2000;5:131–42.Search in Google Scholar
38. Lonquety M, Chomilier J, Papandreou N, Lacroix Z. Prediction of stability upon mutation in the context of the folding nucleus. OMICS 2010;14:151–6.10.1089/omi.2009.0022Search in Google Scholar PubMed
39. Acuña R, Lacroix Z, Chomilier J. A workflow for the prediction of the effects of residue substitution on protein stability. Pattern Recogn Bioinform LNCS 2013; 7986:253–64.10.1007/978-3-642-39159-0_23Search in Google Scholar
40. Néron B, Ménager H, Maufrais C, Joly N, Maupetit J, Letart S, et al. Mobyle: a new full web bioinformatics framework. Bioinformatics 2009;25:3005–11.10.1093/bioinformatics/btp493Search in Google Scholar PubMed PubMed Central
41. Strauser E, Naveau M, Ménager H, Maupetit J, Lacroix Z, Tufféry P. Semantic map for structural bioinformatics: enhanced service discovery based on high level concept ontology. Resour Discov LNCS 2010;6799:57–70.10.1007/978-3-642-27392-6_5Search in Google Scholar
42. Lacroix Z, Critchlow T, editors. Bioinformatics: managing scientific data. San Francisco: Morgan Kaufmann, 2003.Search in Google Scholar
43. Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J. Protein folding: flexible lattice models. Prog Theor Phys Suppl 2000;138:292–300.10.1143/PTPS.138.292Search in Google Scholar
44. Ravichandran L, Papandreou-Suppappola A, Spanias A, Lacroix Z, Legendre C. Waveform mapping and time-frequency processing of DNA and protein sequences. IEEE Trans Signal Process 2011;59:4210–24.10.1109/TSP.2011.2157915Search in Google Scholar
45. Papandreou N, Berezovsky IN, Lopes A, Eliopoulos E, Chomilier J. Universal positions in globular proteins – from observation to simulation. Eur J Biochem 2004;271:4762–8.10.1111/j.1432-1033.2004.04440.xSearch in Google Scholar PubMed
46. Miyazawa S, Jernigan RL. Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 1985;18:534–52.10.1021/ma00145a039Search in Google Scholar
47. Marsaglia G, Zaman A. The KISS generator. Technical report, Department of Statistics, Florida State University, 1993.Search in Google Scholar
©2014 by De Gruyter