Abstract
We discuss the relationship between the problem of protein tertiary structure prediction from the amino acid sequence and the uncertainty analysis. The algorithm presented in this paper belongs to the category of decoy-based modeling, where different known protein models are used to establish a low dimensional space via principal component analysis. The low dimensional space is utilized to perform an energy optimization via a family of very explorative particle swarm optimizers to find the global minimum. The aim of this procedure is to get a representative sample of the nonlinear equivalent region, that is, protein models that have their energy lower than a certain energy bound. The posterior analysis of this family provides very valuable information about the backbone structure of the native conformation and its possible alternate states. This methodology has the advantage of being simple and fast and can help refine the tertiary protein structure. We comprehensively illustrate the performance of our algorithm on one protein from the CASP-9 protein structure prediction experiment. We also provide a theoretical analysis of the energy landscape found in the tertiary structure protein inverse problem, explaining why model reduction techniques (principal component analysis in this case) serve to alleviate the ill-posed character of this high dimensional optimization problem. In addition, we expand the computational benchmark with a summary of other CASP-9 proteins in the Appendix.
Similar content being viewed by others
References
Tyka MD et al (2011) Alternate states of proteins revealed by detailed energy landscape mapping. J Mol Biol 405:607–618
Zhang Y (2008) Progress and challenges in protein structure prediction. Curr Opin Struc Biol 18:342–348
Stoker HS (2015) Organic and biological chemistry. Cengage Learning, Boston
Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56–68
Jowie BU et al. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253:164–170
Alvarez-Machancoses O et al (2018) Principal component analysis in protein tertiary structure. J Boinf Comp Biol 16:1850005
Sarawasthi S, Fernández-Martínez JL et al. (2012) Fast learning optimized prediction methodology (FLOPRED) for protein secondary structure prediction. J Mol Model 18:4275–4289
Araswathi S, Fernández Martínez JL et al. (2013) An aminoacid perspective to secondary structure prediction. J Mol Model 19:4337–4348
Baker D, Sali A (2001) Protein stucture prediction and structural genomics. Science 294:93–96
Ramelot TA et al. (2009) Improving NMR protein structure quality by Rosetta refinement: a molecular replacement study. Proteins 75:147–167
Gniewek P et al. (2014) BioShell - threading: a versatile Monte Carlo package for protein threading. BMC Bioinform 22:22
Gniewek P et al. (2012) How noise in force fields can affect the structural refinement of protein models. Proteins: Stuct Funct Bionf 80:335–341
Gront D, Kolinski A (2006) Bioshell - A package of tools for structural biology prediction. Bioinformatics 22:621–622
Gront D, Kolinski A (2008) Utility library for structural bioinformatics. Bioinformatics 24:584–585
Yang Y, Zhou Y (2008) Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins 72:793–803
Qiu D et al. (1997) The GB/SA Continuum Model for Solvation. A Fast Analytical Method for the Calculation of Approximate Born Radii. J Phys Chem A 101:3005–3014
Price SL (2008) From crystal structure prediction to polymorph prediction: interpreting the crystal energy landscape. Phys Chem Chem Phys 2008:1996–2009
Goldenberg DP, Creighton TE (2004) Energetics of protein structure and folding. Biopolymers 24:167–182
Fernández-Martínez JL, García-Gonzale E (2011) Stochastic stability analysis of the linear continuous and discrete PSO models. Trans Evol Comp 15:405–423
Fernández-Martínez JL, García-Gonzalo E (2012) Stochastic stability and numerical analysis of two novel algorithms of the PSO family: PP-PSO and RR-PSO. Int J Artif Intell Tools 21:1240011
Fernández-Martínez JL et al. (2013) From Bayes to Tarantola: New insights to understand uncertainty in inverse problems. J Appl Geophys 98:62–72
Fernández-Martínez JL et al. (2012) On the topography of the cost functional in linear and nonlinear inverse problems. Geophysics W1-W15:77
Fernández-Martínez JL et al. (2014) The effect of the noise and Tikhonov’s regularization in inverse problems. Part I: the linear case. J Appl Geophys 108:176–185
Fernández-Martínez JL (2014) The effect of the noise and Tikhonov’s regularization in inverse problems. Part II: the nonlinear case. J Appl Geophys 108:186–193
Zhang Y, Skolnick J (2004) SPICKER: a clustering approach to identify near-native protein folds. J Comp Chem 25:865–871
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phylo Mag 2:559–572
Fernández-Martínez JL et al. (2012) Reservoir characterization and inversion uncertainty via a family of particle swarm optimizers. Geophysics 77–1:1–16
Jolliffe I (2002) Principal component analysis. Springer, New York
Quian B et al. (2004) Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc NatL Acad Sci USA 101:15346–15351
Tarantola A (2005) Inverse problem theory and methods for model parameter estimation. SIAM, Philadelphia
Fernández-Martínez JL (2015) Model reduction and uncertainty analysis in inverse problems. Leading Edge 34:1006–1016
Kennedy J, Eberhart R (1995) A new optimizers using particle swarm theory. Proc Sixth Int Symp Micro Mach Human Sci
Fernández-Martínez JL, García-Gonzalo E (2008) The generalized PSO: a new door to PSO evolution. J Artif Evol Appl: 861275
Fernández-Martínez JL, García-Gonzalo E (2009) The PSO family: deduction, stochastic analysis and comparison. Swarm Intell 3:245–273
Aramini JM et al. (2010) Solution NMR structure of a putative uracil DNA glycosylase from Methanosarcina acetivorans. Northeast structural genomics consortium target MvR76
Fernández-Martínez JL et al. (2012) Stochastic stability and numerical analysis of two novel algorithms of the PSO family: PP-PSO and RR-PSO. Int J Artif Intell Tools 21:1240011
Fernández-Martínez JL, García Gonzalo E (2011) Stochastic stability analysis of the linear continuous and discrete PSO models. IEEE Trans Evol Comput 15:405–423
Acknowledgments
A. K. acknowledges financial support from NSF grant DBI 1661391, NIH grants R01 GM127701 and R01 GM127701-01S1, and from Bridge funding from The Research Institute at Nationwide Children’s Hospital. We also acknowledge Ms. Celia Fernández-Brillet for her help in revising this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix. Supporting Information
Appendix. Supporting Information
In this section, we aim to expand the paper benchmark by presenting results for an additional set of nine proteins from CASP9. We tested our methodology in order to prove its suitability for protein refinement purposes.
T0551 – X-ray crystal structure of protein SP_0782 (7-79) from Streptococcus pneumoniae. Northeast Structural Genomics Consortium Target SpR104
We present the numerical results of the application of PSO in order to obtain the tertiary structure of protein SP_0782 (7-79) from Streptococcus pneumoniae., whose native structure has been obtained through X-ray by Kuzin et al. [36].
Owing to the complexities experienced in performing the optimization of the T0551 structure, a swarm composed of 70 particles was applied. Additionally, the tenth percentile of the best templates was chosen. By taking into account these considerations, we ensure a good convergence and protein refinement, while also carrying out a wide sampling over a Search Space constructed with good a priori models.
As observed in Fig. 11, the energy converges in the first 20 iterations and achieves energy of –162.3. Because the majority of the models fluctuate around this energy, we obtain a protein with a low uncertainty as shown in Fig. 12.
T0555 – PBS linker polypeptide domain of phycobilisome linker protein
We present the numerical results of the application of PSO in order to obtain the tertiary structure of the PBS linker polypeptide domain of phycobilisome linker protein, whose native structure has been obtained through RMN by Ramelot et al. [37].
Owing to the complexities experienced in performing the optimization of the T0555 structure, a swarm composed of 80 particles was applied. Additionally, the fifth percentile of the best templates was chosen. By taking into account these considerations, we ensure a good convergence and protein refinement, while also carrying out a wide sampling over a search space constructed with good a priori models.
As observed in Fig. 13, the energy converges fast in the first 5 iterations and achieves energy of –371.4. Because the majority of the models fluctuate around this energy, we obtain a protein with a low uncertainty as shown in Fig. 14.
T0557 – N-terminal domain of putative ATP-dependent DNA helicase RecG-related protein from Nitrosomonas europaea
CASP9 T0557 protein performance under the algorithm presented in this paper is shown graphically in Figs. 15 and 16. The native structure of this protein has been obtained via NMR by Eletsky et al. [38] at the Northeast Structural Genomics Consortium.
We require the utilization of a swarm composed of 80 particles for protein T0555. Additionally, the fifth percentile of the best templates was chosen in order to ensure a good convergence and a good protein refinement.
T0561 – The structural basis for recognition of J-base containing DNA by a novel DNA-binding domain in JBP1
We performed PSO methodology to CASP9 protein T0561, whose native structure has been obtained through X-ray diffraction by Heidebrecht et al. [39]. To accomplish a proper convergence, we utilized a swarm size composed of 60 particles and the tenth percentile of the best protein decoys (Figs. 17 and 18).
T0580 – The lactose-specific IIB component domain structure of the phosphoenolpyruvate:carbohydrate phosphotransferase system (PTS) from Streptococcus pneumoniae
CASP9 protein T0580 obtained by Cuff, M.E. [40] through X-Ray Diffraction has been optimized by PSO. In this sense a swarm size of 60 particles and the tenth percentile of the best decoys have been selected (Figs. 19 and 20).
T0635 – The putative HAD superfamily (subfamily III A) hydrolase from Legionella pneumophila
We present the numerical results of the application of PSO in order to obtain the tertiary structure of the purative HAD superfamily (subfamily III A) hydrolase from Legionella pneumophila, whose native structure has been obtained through X-ray diffraction by Ramagopal et al. [41].
Since this protein has a very difficult topology to perform the algorithm, a swarm composed of 70 particles was applied. Additionally, the fifth percentile of the best templates was chosen. By taking into account these considerations, we ensure a good convergence and a good protein refinement by carrying out a wide sampling and good a priori models, as performed for previous proteins.
As observed in Fig. 21, the energy converges fast in the first five iterations and achieves an energy of –465.5. Because the majority of the models fluctuate around this energy, we obtain a protein with a low uncertainty as shown in Fig. 22, where only a small variation is observed in the extremes of the protein, while negligible variations are observed in the central atoms.
T0637 – Crystal structure of the hypothetical protein PA0856 from Pseudomonas aeruginosa.
Additionally, PSO capabilities have been tested for a hypothetical protein listed in the CASP9 experiment. It has been reported by Oke et al. [42] in the Scottish Structural Proteomics Facility. The PSO was performed utilizing a swarm of 70 particles and the tenth percentile of the best protein decoys submitted at the experiment. In this sense, the energy obtained was –372.0 with a very low variability within the models. Consequently, the uncertainty of the protein is low and only minor variations are observed in the extremes, the most sensitive part (Figs. 23 and 24).
T0639 – Crystal structure of functionally unknown protein from Neisseria meningitidis MC58.
Protein T0639 from the CASP9 experiment, a protein from Neisseria meningitidis MC58, whose native structure was obtained by Zhang. et al. [43] from the Midwest Center for Structural Genomics via X-ray diffraction. Similar to the case of T0637, this protein rapidly achieves its minimum at –342.7. The RMSD improvement confirms that the protein is successfully refined through PCA and RR-PSO optimization (Figs. 25 and 26).
T0643 – Crystal structure of the N-terminal domain of DNA-binding protein SATB1 from Homo sapiens.
Protein T0643 was also considered, which corresponds to the N-terminal domain of DNA-binding protein SATB1, whose native structure was obtained through X-ray diffraction by Forouhar et al. [44]. The algorithm performance is presented in Figs. 27 and 28. Figure 27 shows how PSO was capable of optimizing the energy of the protein while carrying out the sampling in the region below –200. The sampling in the region below –200 quantifies the structure uncertainty as shown in Fig. 28.
Rights and permissions
About this article
Cite this article
Álvarez, Ó., Fernández-Martínez, J.L., Corbeanu, A.C. et al. Predicting protein tertiary structure and its uncertainty analysis via particle swarm sampling. J Mol Model 25, 79 (2019). https://doi.org/10.1007/s00894-019-3956-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00894-019-3956-0