Chapter Six - Scientific Benchmarks for Guiding Macromolecular Energy Function Improvement
Introduction
Scientific benchmarks are essential for the development and parameterization of molecular modeling energy functions. Widely used molecular mechanics energy functions such as Amber and OPLS were originally parameterized with experimental and quantum chemistry data from small molecules and benchmarked against experimental observables such as intermolecular energies in the gas phase, solution phase densities, and heats of vaporization (Jorgensen et al., 1996, Weiner et al., 1984). More recently, thermodynamic measurements and high-resolution structures of macromolecules have provided a valuable testing ground for energy function development. Commonly used scientific tests include discriminating the ground state conformation of a macromolecule from higher energy conformations (Novotný et al., 1984, Park and Levitt, 1996, Simons et al., 1999), and predicting amino acid sidechain conformations (Bower et al., 1997, Jacobson et al., 2002) and free energy changes associated with protein mutations (Gilis and Rooman, 1997, Guerois et al., 2002, Potapov et al., 2009).
Many studies have focused on optimizing an energy function for a particular problem in macromolecular modeling, for instance, the FoldX energy function was empirically parameterized for predicting changes to the free energy of a protein when it is mutated (Guerois et al., 2002). Often, these types of energy functions are well suited only to the task they have been trained for. Kellogg, Leaver-Fay, and Baker (2011) showed that an energy function explicitly trained to predict energies of mutation did not produce native-like sequences when redesigning proteins. For many projects, it is advantageous to have a single energy function that can be used for diverse modeling tasks. For example, protocols in the molecular modeling program Rosetta for ligand docking (Meiler & Baker, 2003), protein design (Kuhlman et al., 2003), and loop modeling (Wang, Bradley, & Baker, 2007) share a common energy function, which allowed Murphy, Bolduc, Gallaher, Stoddard, and Baker (2009) to combine them to shift an enzyme's substrate specificity.
Sharing a single energy function between modeling applications presents both opportunities and challenges. Researchers applying the energy function to new tasks sometimes uncover deficiencies in the energy function. The opportunities are that correcting the deficiencies in the new tasks will result in improvements in the older tasks—after all, nature uses only one energy function. Sometimes, however, modifications to the energy function that improve its performance at one task degrade its performance at others. The challenges are then to discriminate beneficial from deleterious modifications and reconcile task-specific objectives.
To address these challenges, we have developed three tools based on benchmarking Rosetta against macromolecular data. The first tool (Section 3), a suite we call “feature analysis,” can be used to contrast ensembles of structural details from structures in the PDB and from structures generated by Rosetta. The second tool (Section 4), a program we call “optE,” relies on fast, small-scale benchmarks to train the weights in the energy function. These two tools can help identify and fix flaws in the energy function, facilitating the process of integrating a proposed modification. We follow (Section 5) with a curated set of large-scale benchmarks meant to provide sufficient coverage of Rosetta's applications. The use of these benchmarks will provide evidence that a proposed energy function modification should be widely adopted. To conclude (Section 6), we demonstrate our tools and benchmarks by evaluating three incremental modifications to the Rosetta energy function.
Alongside this chapter, we have created an online appendix, which documents usage of the tools, input files, instructions for running the benchmarks, and current testing results: http://rosettatests.graylab.jhu.edu/guided_energy_function_improvement.
Section snippets
Energy Function Model
The Rosetta energy function is a linear combination of terms that model interactions between atoms, solvation effects, and torsion energies. More specifically, Score12 (Rohl, Strauss, Misura, & Baker, 2004), the default fullatom energy function in Rosetta, consists of a Lennard–Jones term, an implicit solvation term (Lazaridis & Karplus, 1999), an orientation-dependent H-bond term (Kortemme, Morozov, & Baker, 2003),sidechain and backbone torsion potentials derived from the PDB, a short-ranged
Feature Analysis
We aim to facilitate the analysis of distributions of measurable properties of molecular conformations, which we call “feature analysis.” By formalizing the analysis process, we are able to create a suite of tools and benchmarks that unify the collection, visualization, and comparison of feature distributions. After motivating our work, we describe the components (Section 3.1) and illustrate how they can be integrated into a workflow (Section 3.2) by investigating the distribution of the
Maximum Likelihood Parameter Estimation with optE
Recall that the Rosetta energy function is a weighted linear combination of energy terms that capture different aspects of molecular structure, as defined in Eq. (6.1). The weights, w, balance the contribution of each term to give the overall energy. Because the weights often need adjusting after modifying an energy term, we have developed a tool called “optE” to facilitate fitting them against scientific benchmarks. The benchmarks are small, tractable tests of Rosetta's ability to recapitulate
Large-Scale Benchmarks
Scientific benchmarking allows energy function comparison. The tests most pertinent to the Rosetta community often aim toward recapitulating observations from crystal structures. In this section, we describe a curated set of previously published benchmarks, which together provide a comprehensive view of an energy function's strengths and weaknesses. We continually test the benchmarks on the RosettaTests server to allow us to immediately detect changes to Rosetta that degrades its overall
Three Proposed Changes to the Rosetta Energy Function
In this final section, we describe three changes to Rosetta's energy function. After describing each change and its rationale, we present the results of the benchmarks described above.
Conclusion
We have described three tools that can be used to evaluate and improve macromolecular energy functions. Inaccuracies in the energy function can be identified by comparing features from crystal structures and computationally generated structures. New or reparameterized energy terms can be rapidly tested with optE to determine if the change improves structure prediction and sequence design. When a new term is ready to be rigorously tested, we can test for unintended changes to feature
Acknowledgments
Support for A. L. F., M. J. O., and B. K. came from GM073151 and GM073960. Support for J. S. R. came from NIH R01 GM073930. Thanks to Steven Combs for bringing the bicubic-spline implementation to Rosetta.
References (59)
- et al.
Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool
Journal of Molecular Biology
(1997) - et al.
High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design
Journal of Molecular Biology
(2007) - et al.
High-resolution structural validation of the computational redesign of human U1A protein
Structure
(2006) Rotamer libraries in the 21st century
Current Opinion in Structural Biology
(2002)- et al.
Backbone dependent rotamer library for proteins: Application to side chain prediction
Journal of Molecular Biology
(1993) - et al.
Predicting protein stability changes upon mutation using database-derived potentials: Solvent accessibility determines the importance of local versus non-local interactions along the sequence
Journal of Molecular Biology
(1997) - et al.
Predicting changes in the stability of proteins and protein complexes: A study of more than 1000 mutations
Journal of Molecular Biology
(2002) - et al.
An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes
Journal of Molecular Biology
(2003) - et al.
ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules
Methods in Enzymology
(2011) - et al.
Potential functions for hydrogen bonds in protein structure prediction and design
Advances in Protein Chemistry
(2005)
An analysis of incorrectly folded protein models. Implications for structure predictions
Journal of Molecular Biology
Energy functions that discriminate X-ray and near native folds from well-constructed decoys
Journal of Molecular Biology
Protein sidechain conformer prediction: A test of the energy function
Folding and Design
Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes
Journal of Molecular Biology
Protein structure prediction using Rosetta
Methods in Enzymology
A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions
Structure
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions
Journal of Molecular Biology
Calculation of conformational ensembles from potentials of mean force: An approach to the knowledge-based prediction of local structures in globular proteins
Journal of Molecular Biology
Protein-protein docking with backbone flexibility
Journal of Molecular Biology
Asparagine and glutamine: Using hydrogen atom contacts in the choice of side-chain amide orientation
Journal of Molecular Biology
MolProbity: All-atom structure validation for macromolecular crystallography
Acta Crystallographica. Section D: Biological Crystallography
Sodock: Swarm optimization for highly flexible protein-ligand docking
Journal of Computational Chemistry
Atomic accuracy in predicting and designing noncanonical RNA structure
Nature Methods
Emergence of protein fold families through rational design
PLoS Computational Biology
C-H⋯O hydrogen bonds in β-sheets
Acta Crystallographica. Section D: Biological Crystallography
RosettaScripts: A scripting language interface to the Rosetta macromolecular modeling suite
PLoS One
Potentials of mean force for protein structure prediction vindicated, formalized and generalized
PLoS One
Computational protein design with explicit consideration of surface hydrophobic patches
Proteins
Force field validation using protein side chain prediction
The Journal of Physical Chemistry B
Cited by (174)
DIProT: A deep learning based interactive toolkit for efficient and effective Protein design
2024, Synthetic and Systems BiotechnologyImplicit model to capture electrostatic features of membrane environment
2024, PLoS Computational BiologyComputational design of N-linked glycans for high throughput epitope profiling
2023, Protein ScienceModeling membrane geometries implicitly in Rosetta
2023, bioRxiv
- 1
These authors contributed equally to this work.