Skip to main content

Advertisement

Log in

Predicting protein tertiary structure and its uncertainty analysis via particle swarm sampling

  • Original Paper
  • Published:
Journal of Molecular Modeling Aims and scope Submit manuscript

Abstract

We discuss the relationship between the problem of protein tertiary structure prediction from the amino acid sequence and the uncertainty analysis. The algorithm presented in this paper belongs to the category of decoy-based modeling, where different known protein models are used to establish a low dimensional space via principal component analysis. The low dimensional space is utilized to perform an energy optimization via a family of very explorative particle swarm optimizers to find the global minimum. The aim of this procedure is to get a representative sample of the nonlinear equivalent region, that is, protein models that have their energy lower than a certain energy bound. The posterior analysis of this family provides very valuable information about the backbone structure of the native conformation and its possible alternate states. This methodology has the advantage of being simple and fast and can help refine the tertiary protein structure. We comprehensively illustrate the performance of our algorithm on one protein from the CASP-9 protein structure prediction experiment. We also provide a theoretical analysis of the energy landscape found in the tertiary structure protein inverse problem, explaining why model reduction techniques (principal component analysis in this case) serve to alleviate the ill-posed character of this high dimensional optimization problem. In addition, we expand the computational benchmark with a summary of other CASP-9 proteins in the Appendix.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Tyka MD et al (2011) Alternate states of proteins revealed by detailed energy landscape mapping. J Mol Biol 405:607–618

    Article  CAS  Google Scholar 

  2. Zhang Y (2008) Progress and challenges in protein structure prediction. Curr Opin Struc Biol 18:342–348

    Article  CAS  Google Scholar 

  3. Stoker HS (2015) Organic and biological chemistry. Cengage Learning, Boston

  4. Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56–68

  5. Jowie BU et al. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253:164–170

  6. Alvarez-Machancoses O et al (2018) Principal component analysis in protein tertiary structure. J Boinf Comp Biol 16:1850005

    Article  Google Scholar 

  7. Sarawasthi S, Fernández-Martínez JL et al. (2012) Fast learning optimized prediction methodology (FLOPRED) for protein secondary structure prediction. J Mol Model 18:4275–4289

  8. Araswathi S, Fernández Martínez JL et al. (2013) An aminoacid perspective to secondary structure prediction. J Mol Model 19:4337–4348

  9. Baker D, Sali A (2001) Protein stucture prediction and structural genomics. Science 294:93–96

    Article  CAS  Google Scholar 

  10. Ramelot TA et al. (2009) Improving NMR protein structure quality by Rosetta refinement: a molecular replacement study. Proteins 75:147–167

  11. Gniewek P et al. (2014) BioShell - threading: a versatile Monte Carlo package for protein threading. BMC Bioinform 22:22

  12. Gniewek P et al. (2012) How noise in force fields can affect the structural refinement of protein models. Proteins: Stuct Funct Bionf 80:335–341

  13. Gront D, Kolinski A (2006) Bioshell - A package of tools for structural biology prediction. Bioinformatics 22:621–622

  14. Gront D, Kolinski A (2008) Utility library for structural bioinformatics. Bioinformatics 24:584–585

  15. Yang Y, Zhou Y (2008) Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins 72:793–803

  16. Qiu D et al. (1997) The GB/SA Continuum Model for Solvation. A Fast Analytical Method for the Calculation of Approximate Born Radii. J Phys Chem A 101:3005–3014

  17. Price SL (2008) From crystal structure prediction to polymorph prediction: interpreting the crystal energy landscape. Phys Chem Chem Phys 2008:1996–2009

  18. Goldenberg DP, Creighton TE (2004) Energetics of protein structure and folding. Biopolymers 24:167–182

  19. Fernández-Martínez JL, García-Gonzale E (2011) Stochastic stability analysis of the linear continuous and discrete PSO models. Trans Evol Comp 15:405–423

  20. Fernández-Martínez JL, García-Gonzalo E (2012) Stochastic stability and numerical analysis of two novel algorithms of the PSO family: PP-PSO and RR-PSO. Int J Artif Intell Tools 21:1240011

  21. Fernández-Martínez JL et al. (2013) From Bayes to Tarantola: New insights to understand uncertainty in inverse problems. J Appl Geophys 98:62–72

  22. Fernández-Martínez JL et al. (2012) On the topography of the cost functional in linear and nonlinear inverse problems. Geophysics W1-W15:77

  23. Fernández-Martínez JL et al. (2014) The effect of the noise and Tikhonov’s regularization in inverse problems. Part I: the linear case. J Appl Geophys 108:176–185

  24. Fernández-Martínez JL (2014) The effect of the noise and Tikhonov’s regularization in inverse problems. Part II: the nonlinear case. J Appl Geophys 108:186–193

  25. Zhang Y, Skolnick J (2004) SPICKER: a clustering approach to identify near-native protein folds. J Comp Chem 25:865–871

  26. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phylo Mag 2:559–572

  27. Fernández-Martínez JL et al. (2012) Reservoir characterization and inversion uncertainty via a family of particle swarm optimizers. Geophysics 77–1:1–16

  28. Jolliffe I (2002) Principal component analysis. Springer, New York

  29. Quian B et al. (2004) Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc NatL Acad Sci USA 101:15346–15351

  30. Tarantola A (2005) Inverse problem theory and methods for model parameter estimation. SIAM, Philadelphia

  31. Fernández-Martínez JL (2015) Model reduction and uncertainty analysis in inverse problems. Leading Edge 34:1006–1016

  32. Kennedy J, Eberhart R (1995) A new optimizers using particle swarm theory. Proc Sixth Int Symp Micro Mach Human Sci

  33. Fernández-Martínez JL, García-Gonzalo E (2008) The generalized PSO: a new door to PSO evolution. J Artif Evol Appl: 861275

  34. Fernández-Martínez JL, García-Gonzalo E (2009) The PSO family: deduction, stochastic analysis and comparison. Swarm Intell 3:245–273

  35. Aramini JM et al. (2010) Solution NMR structure of a putative uracil DNA glycosylase from Methanosarcina acetivorans. Northeast structural genomics consortium target MvR76

  36. Fernández-Martínez JL et al. (2012) Stochastic stability and numerical analysis of two novel algorithms of the PSO family: PP-PSO and RR-PSO. Int J Artif Intell Tools 21:1240011

  37. Fernández-Martínez JL, García Gonzalo E (2011) Stochastic stability analysis of the linear continuous and discrete PSO models. IEEE Trans Evol Comput 15:405–423

Download references

Acknowledgments

A. K. acknowledges financial support from NSF grant DBI 1661391, NIH grants R01 GM127701 and R01 GM127701-01S1, and from Bridge funding from The Research Institute at Nationwide Children’s Hospital. We also acknowledge Ms. Celia Fernández-Brillet for her help in revising this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Luis Fernández-Martínez.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix. Supporting Information

Appendix. Supporting Information

In this section, we aim to expand the paper benchmark by presenting results for an additional set of nine proteins from CASP9. We tested our methodology in order to prove its suitability for protein refinement purposes.

T0551 – X-ray crystal structure of protein SP_0782 (7-79) from Streptococcus pneumoniae. Northeast Structural Genomics Consortium Target SpR104

We present the numerical results of the application of PSO in order to obtain the tertiary structure of protein SP_0782 (7-79) from Streptococcus pneumoniae., whose native structure has been obtained through X-ray by Kuzin et al. [36].

Owing to the complexities experienced in performing the optimization of the T0551 structure, a swarm composed of 70 particles was applied. Additionally, the tenth percentile of the best templates was chosen. By taking into account these considerations, we ensure a good convergence and protein refinement, while also carrying out a wide sampling over a Search Space constructed with good a priori models.

As observed in Fig. 11, the energy converges in the first 20 iterations and achieves energy of –162.3. Because the majority of the models fluctuate around this energy, we obtain a protein with a low uncertainty as shown in Fig. 12.

Fig. 11
figure 11

T0551 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 12
figure 12

T0551 posterior sampling in the region of energy lower than –200. a) Median protein of the decoys sampled in the region of energy corresponding to the 10th percentile. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0555 – PBS linker polypeptide domain of phycobilisome linker protein

We present the numerical results of the application of PSO in order to obtain the tertiary structure of the PBS linker polypeptide domain of phycobilisome linker protein, whose native structure has been obtained through RMN by Ramelot et al. [37].

Owing to the complexities experienced in performing the optimization of the T0555 structure, a swarm composed of 80 particles was applied. Additionally, the fifth percentile of the best templates was chosen. By taking into account these considerations, we ensure a good convergence and protein refinement, while also carrying out a wide sampling over a search space constructed with good a priori models.

As observed in Fig. 13, the energy converges fast in the first 5 iterations and achieves energy of –371.4. Because the majority of the models fluctuate around this energy, we obtain a protein with a low uncertainty as shown in Fig. 14.

Fig. 13
figure 13

T0555 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 14
figure 14

T0555 posterior sampling in the region of energy lower than –200. a) Median protein of the decoys sampled in the region of energy corresponding to the 5th percentile. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0557 – N-terminal domain of putative ATP-dependent DNA helicase RecG-related protein from Nitrosomonas europaea

CASP9 T0557 protein performance under the algorithm presented in this paper is shown graphically in Figs. 15 and 16. The native structure of this protein has been obtained via NMR by Eletsky et al. [38] at the Northeast Structural Genomics Consortium.

We require the utilization of a swarm composed of 80 particles for protein T0555. Additionally, the fifth percentile of the best templates was chosen in order to ensure a good convergence and a good protein refinement.

Fig. 15
figure 15

T0557 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 16
figure 16

T0557 posterior sampling in the region of energy lower than –150. a) Median protein of the decoys sampled in the region of energy corresponding to the 5th percentile. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0561 – The structural basis for recognition of J-base containing DNA by a novel DNA-binding domain in JBP1

We performed PSO methodology to CASP9 protein T0561, whose native structure has been obtained through X-ray diffraction by Heidebrecht et al. [39]. To accomplish a proper convergence, we utilized a swarm size composed of 60 particles and the tenth percentile of the best protein decoys (Figs. 17 and 18).

Fig. 17
figure 17

T0561 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 18
figure 18

T0561 posterior sampling in the region of energy lower than 0. a) Median protein of the decoys sampled in the region of energy corresponding to the 10th percentile. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0580 – The lactose-specific IIB component domain structure of the phosphoenolpyruvate:carbohydrate phosphotransferase system (PTS) from Streptococcus pneumoniae

CASP9 protein T0580 obtained by Cuff, M.E. [40] through X-Ray Diffraction has been optimized by PSO. In this sense a swarm size of 60 particles and the tenth percentile of the best decoys have been selected (Figs. 19 and 20).

Fig. 19
figure 19

T0580 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 20
figure 20

T0580 posterior sampling in the region of energy lower than 0. a) Median protein of the decoys sampled in the region of energy corresponding to the 10th percentile. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0635 – The putative HAD superfamily (subfamily III A) hydrolase from Legionella pneumophila

We present the numerical results of the application of PSO in order to obtain the tertiary structure of the purative HAD superfamily (subfamily III A) hydrolase from Legionella pneumophila, whose native structure has been obtained through X-ray diffraction by Ramagopal et al. [41].

Since this protein has a very difficult topology to perform the algorithm, a swarm composed of 70 particles was applied. Additionally, the fifth percentile of the best templates was chosen. By taking into account these considerations, we ensure a good convergence and a good protein refinement by carrying out a wide sampling and good a priori models, as performed for previous proteins.

As observed in Fig. 21, the energy converges fast in the first five iterations and achieves an energy of –465.5. Because the majority of the models fluctuate around this energy, we obtain a protein with a low uncertainty as shown in Fig. 22, where only a small variation is observed in the extremes of the protein, while negligible variations are observed in the central atoms.

Fig. 21
figure 21

T0635 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 22
figure 22

T0635 posterior sampling in the region of energy lower than 0. a) Median protein of the decoys sampled. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0637 – Crystal structure of the hypothetical protein PA0856 from Pseudomonas aeruginosa.

Additionally, PSO capabilities have been tested for a hypothetical protein listed in the CASP9 experiment. It has been reported by Oke et al. [42] in the Scottish Structural Proteomics Facility. The PSO was performed utilizing a swarm of 70 particles and the tenth percentile of the best protein decoys submitted at the experiment. In this sense, the energy obtained was –372.0 with a very low variability within the models. Consequently, the uncertainty of the protein is low and only minor variations are observed in the extremes, the most sensitive part (Figs. 23 and 24).

Fig. 23
figure 23

T0637 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 24
figure 24

T0637 posterior sampling in the region of energy lower than 0. a) Median protein of the decoys sampled. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0639 – Crystal structure of functionally unknown protein from Neisseria meningitidis MC58.

Protein T0639 from the CASP9 experiment, a protein from Neisseria meningitidis MC58, whose native structure was obtained by Zhang. et al. [43] from the Midwest Center for Structural Genomics via X-ray diffraction. Similar to the case of T0637, this protein rapidly achieves its minimum at –342.7. The RMSD improvement confirms that the protein is successfully refined through PCA and RR-PSO optimization (Figs. 25 and 26).

Fig. 25
figure 25

T0639 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 26
figure 26

T0639 posterior sampling in the region of energy lower than –300. a) Median protein of the decoys sampled. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

T0643 – Crystal structure of the N-terminal domain of DNA-binding protein SATB1 from Homo sapiens.

Protein T0643 was also considered, which corresponds to the N-terminal domain of DNA-binding protein SATB1, whose native structure was obtained through X-ray diffraction by Forouhar et al. [44]. The algorithm performance is presented in Figs. 27 and 28. Figure 27 shows how PSO was capable of optimizing the energy of the protein while carrying out the sampling in the region below –200. The sampling in the region below –200 quantifies the structure uncertainty as shown in Fig. 28.

Fig. 27
figure 27

T0639 protein. a) Convergence curve. b) Median dispersion curve (%)

Fig. 28
figure 28

T0643 posterior sampling in the region of energy lower than –150. a) Median protein of the decoys sampled. b) Median protein plus the interquartile range of the coordinates of these decoys. c) Median protein minus the interquartile range of the coordinates of these decoys

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Álvarez, Ó., Fernández-Martínez, J.L., Corbeanu, A.C. et al. Predicting protein tertiary structure and its uncertainty analysis via particle swarm sampling. J Mol Model 25, 79 (2019). https://doi.org/10.1007/s00894-019-3956-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00894-019-3956-0

Keywords

Navigation