Journal of Molecular Biology
Volume 283, Issue 2, 23 October 1998, Pages 507-526
Journal home page for Journal of Molecular Biology

Regular article
Protein structure prediction by threading. why it works and why it does not1

https://doi.org/10.1006/jmbi.1998.2092Get rights and content

Abstract

We developed a novel Monte Carlo threading algorithm which allows gaps and insertions both in the template structure and threaded sequence. The algorithm is able to find the optimal sequence-structure alignment and sample suboptimal alignments. Using our algorithm we performed sequence-structure alignments for a number of examples for three protein folds (ubiquitin, immunoglobulin and globin) using both “ideal” set of potentials (optimized to provide the best Z-score for a given protein) and more realistic knowledge-based potentials. Two physically different scenarios emerged. If a template structure is similar to the native one (within 2 Å RMS), then (i) the optimal threading alignment is correct and robust with respect to deviations of the potential from the “ideal” one; (ii) suboptimal alignments are very similar to the optimal one; (iii) as Monte Carlo temperature decreases a sharp cooperative transition to the optimal alignment is observed. In contrast, if the template structure is only moderately close to the native structure (RMS greater than 3.5 Å), then (i) the optimal alignment changes dramatically when an “ideal” potential is substituted by the real one; (ii) the structures of suboptimal alignments are very different from the optimal one, reducing the reliability of the alignment; (iii) the transition to the apparently optimal alignment is non-cooperative. In the intermediate cases when the RMS between the template and the native conformations is in the range between 2 Å and 3.5 Å, the success of threading alignment may depend on the quality of potentials used.

These results are rationalized in terms of a threading free energy landscape. Possible ways to overcome the fundamental limitations of threading are discussed briefly.

Introduction

The problem of predicting protein conformation from sequences is of great importance and has drawn a lot of attention recently (see e.g. Moult et al 1997, Shakhnovich 1997a, Finkelstein 1997, Jones 1997, Levitt 1997) with hundreds of papers from dozens of groups.

A most desirable solution to the problem is to find a model and an algorithm that stimulate folding of a protein pretty much in a way that mimics natural protein folding and converges to the native conformation. While some success along these lines has been documented (Kolinski & Skolnick, 1994), this approach encounters a number of serious technical difficulties, making ab initio structure prediction hardly feasible now and perhaps in the foreseeable future Finkelstein 1997, Ortiz et al 1998, Mirny and Shakhnovich 1996. The main reason for such a reclusion was discussed by Shakhnovich 1997a, Finkelstein 1997, Mirny and Shakhnovich 1996: a “good” folding model must be detailed enough to reproduce energetics faithfully and yet simple enough to be computationally feasible, a blend that has not been reached yet.

The energetics requirement is a very important one as far as folding is concerned: the energy function must be precise enough to single out the unique native structure as the global energy minimum among an astronomically large number of decoys, some of which have very low energy (Shakhnovich, 1994). The precision of potentials required to achieve this goal was analyzed by Bryngelson 1994, Pande et al 1995 for lattice model chains and by Mirny & Shakhnovich (1996) for real proteins. The analysis by Mirny & Shakhnovich (1996) suggests that at the level of a simple two-body approximation of energetics and structure-less amino acids there may not exist any potential which is able to fold real proteins into their native conformations. Adding more details into the model is probably the way to go. However, this complicates the search in conformational space, making computations far more demanding. Thus, ab initio folding success is contingent on finding a safe pathway between the Scilla of incorrect energetics and the Kharibda of a too complicated and thus computationally infeasible model.

The complications inherent in ab initio folding were realized early by a number of workers in the field, and alternative approaches were suggested; the most notable of them is threading Finkelstein and Reva 1991, Finkelstein 1997. The key idea of the threading method is to decrease dramatically the number of decoys. This is achieved by constraining all protein conformations to a smaller subset of conformations obtained by threading through known protein structures that serve as a scaffold for the protein sequence in question and finding the energetically optimal alignment of the sequence to the scaffold structure. From the physical point of view the threading problem is somewhat equivalent to folding, because it also requires searching over a large set of possible alignments for the one that delivers minimum “energy”. It was shown (Lathrop, 1994) that such a search is an NP complete problem (i.e. that there is an apparent “Levinthal” paradox in threading). As in folding, the search in threading is biased by the energy function, so that the related key issue is the precision of the energy function. The rationale for using threading rather than folding is the hope that a less precise energy function will suffice for the search over a more constrained conformational set: in this case the native state should be distinguished as having the lowest energy among the smaller number of alternatives. However, such simplification of the conformational space comes at a serious price. The reason is that the native structure itself may not belong to the constrained conformational set! In this case, threading seeks an approximate solution, i.e. the one that is closest to the native state in the conformational set of alignments. However, if this best solution is relatively distant structurally from the native state, its energy may be considerably higher than the energy of the native state (even with the “ideal” potential that strongly favors the native state). This factor clearly may decrease the energy gap, balancing on the negative side the gain achieved due to the restriction of conformational space. Obviously, the conformational space restriction, which is the basis of the threading approach, is not an “innocent” approximation that is guaranteed to work almost by definition. Clearly this issue requires a detailed study that aims to address a question of gains and losses made by threading approximations and add to our intuition about which factors are more important for particular models and when should we expect success in threading simulations and when we cannot.

As pointed out above, threading is very much like folding in terms of key questions and difficulties. In folding one asks basically two questions: are energy functions correct? and is a conformational search efficient enough to find the global minimum? An important advance in protein folding theory, which started from the seminal work of Go (Taketomi et al., 1975), is understanding that those two questions can be studied separately. The first approach proposed by Go (Taketomi et al., 1975) was to design an energy function that favors the native contacts and disfavors non-native ones. Such an energy function gives rise to fast folding (Gutin et al., 1996). However, the artificial penalties imposed on non-native interactions make the model somewhat unphysical for studying the physical principles of folding, because in real life strong non-native contacts cannot be excluded a priori. In fact they occur in some proteins (Lacroix et al., 1997). A more physical model for folding is based on sequence design, which generates, for any given potential function, special sequences for which the native structure is guaranteed to be the global minimum. Then the same potential is used for folding as the one used to design sequences Shakhnovich 1994, Shakhnovich et al 1996a. In this case folding simulations quickly converge to the native state Gutin et al 1996, Shakhnovich 1994. The properties of energy landscape and dynamics that lead to the native state can be studied in detail with implications for folding and evolution of real proteins (Shakhnovich et al., 1996a). An approach that is very similar in spirit is to design a potential which provides low energy to a natural sequence in its native structure Goldstein et al 1992, Mirny and Shakhnovich 1996, Koretke et al 1996, Hao and Scheraga 1996, Ortiz et al 1998. While this energy function may not be transferable to other proteins (Mirny & Shakhnovich, 1996) it serves its purpose by providing a free energy landscape with a large gap and hence reasonably fast folding. One conclusion from the analysis carried out for a number of folding models suggests that Monte Carlo (MC) simulation represents a powerful search strategy that is very efficient in finding the native state on a physically reasonable landscape.

Here we take a similar systematic approach to study threading. First we develop and present a Monte Carlo threading algorithm, which allows gaps and insertions both in structure and in sequence. The advantage of the Monte Carlo approach is that it converges to the Boltzmann distribution. This feature makes this method a valuable tool to map and characterize a free energy landscape and outline the physical requirements of convergence to the global minimum solution.

To test and rationalize the MC threading method we apply it to a number of problems of increasing complexity. The reason behind this “gradual” approach is the need to differentiate between limitations intrinsic to the threading approach, as suggested above, and the ones that originate from uncertainties in potential functions and the form of the Hamiltonian (scoring function) used.

Following this program, first we use our algorithm for the structure-structure alignment. It turns out that in the framework of our approach the structure-structure alignment is a counterpart of the Go model studied in folding: it provides the “ideal” potential function in which only the native interactions are favorable. This simple implementation of the method makes it straightforward to compare it with existing heuristic approaches such as the structure alignment algorithm Dali(Holm & Sander, 1993) used to build the FSSP database. The comparison shows that the MC procedure proposed in this work gives a more optimal alignment than Dali (with the same scoring function as used in Dali).

Next, we turn to a more realistic two-body energy function where interaction energy depends on amino acid types, and distances between them rather than on their location in the native structure. We design an “ideal” parameter set for this Hamiltonian scoring function and study how the approximate character of the potential function affects the results of threading at various degrees of similarity between the template and the native conformations.

Moving closer to the realm of structure prediction, we explore the accuracy of threading, using one of the knowledge based potentials that are currently available in the literature (Miyazawa & Jernigan, 1996).

In order to make our conclusions significant and general we carried out the analysis for three fold classes: α/β (ubiquitin and its structural homologues), all-β (class I immunoglobulin fold) and all-α (globin fold). The results are consistent between the fold classes studied. This allows us to arrive at quantitative conclusions concerning the degree of similarity between the native structure and the template that is required for successful sequence-structure alignment.

In what follows we provide a detailed discussion of the ubiquitin superfamily followed by the data Table 4, Table 5 with comments for the immunoglobulin and globin fold.

Section snippets

MC threading

To sample possible sequence-structure alignments and search for the alignment with the minimal energy we use the Monte Carlo (MC) procedure. The power of the MC procedure is that it allows us to find a global minimum on a variety of rough landscapes (Allen & Tildesley, 1987). In the search for a minimum, it samples possible alignments and allows us to study statistical properties of the energy landscape. This made the MC procedure extremely useful in the study of various disordered physical

Testing MC search strategy with an “ideal” potential

Our first goal is to evaluate the proposed MC threading as a search strategy for threading as well as to probe a possible alignment energy landscape. The results of threading, however, always depend on both the potential and the search strategy. In order to test the search strategy and eliminate the problem of inaccurate potential we use an “ideal” potential.

“Ideal” potential is the one which guarantees the lowest energy to the native conformation. Using dRMS (distance RMS, see Methods) as the

Discussion

Here we presented a systematic approach to the problem of protein structure prediction by threading. As in folding, the problem of threading has two components: (i) a search strategy, which is able to find the optimal alignment of sequence and a structure; and (ii) a potential, which provides the lowest energy to the native and similar structures of a protein. First, we developed a Monte Carlo algorithm to search through the space of alignments for the optimal one. To test the algorithm we

Alignment representation

An alignment between two proteins of length I and J is represented by a matrix Aij, where i = 1,…,I and j = 1,…,J:Aij=1if i is aligned with j0otherwise Another way of presenting an alignment is by a pointer pi:pi=jif i is aligned with j0if i is not aligned to any residue In this study we do not allow double matches (i.e. Σi=1,…,IAij ⩽ 1). The reverse of any fragment in the alignment is also forbidden, i.e. if Aij = 1, then for any i′ >i and j′ < j Aij = 0. (In general, reverse of protein

Acknowledgements

We are grateful to Victor Abkevich and Cecilia Clementi for fruitful discussions. This work was supported by NIH grant GM52126.

References (60)

  • R. Lathrop et al.

    Global optimum proteins threading with gapped alignment and empirical pair score functions

    J. Mol. Biol.

    (1996)
  • A. Marchler-Bauer et al.

    A measure of success in fold recognition

    Trends Biochem. Sci.

    (1997)
  • L. Mirny et al.

    How to derive a protein folding potential? A new approach to an old problem

    J. Mol. Biol.

    (1996)
  • L. Mirny et al.

    Universality and diversity of the protein folding scenariosa comprehensive analysis with the aid of a lattice model

    Fold. Des.

    (1996)
  • S. Miyazawa et al.

    Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading

    J. Mol. Biol.

    (1996)
  • S. Needleman et al.

    A general method applicable to the search for similarities in the amino acid sequence of two proteins

    J. Mol. Biol.

    (1970)
  • V. Pande et al.

    On the theory of folding kinetics for short proteins

    Fold. Des.

    (1997)
  • B. Rost et al.

    Protein fold recognition by prediction-based threading

    J. Mol. Biol.

    (1997)
  • R. Russell et al.

    Protein fold recognition by mapping predicted secondary structures

    J. Mol. Biol.

    (1996)
  • E. Shakhnovich

    Theoretical studies of protein-folding thermodynamics and kinetics

    Curr. Opin. Struct. Biol.

    (1997)
  • E. Shakhnovich

    Theoretical studies of protein-folding thermodynamics and kinetics

    Curr. Opin. Struct. Biol.

    (1997)
  • E. Shakhnovich et al.

    Influence of point mutations on protein structureprobability of a neutral mutation

    J. Theoret. Biol.

    (1991)
  • D. Shortle

    Structure predictionfolding proteins by pattern recognition

    Curr. Biol.

    (1997)
  • M. Sippl

    Knowledge-based potentials for proteins

    Curr. Opin. Struct. Biol.

    (1995)
  • T. Smith et al.

    Identification of common molecular subsequences

    J. Mol. Biol.

    (1981)
  • V. Abkevich et al.

    How the first biopolymers could have evolved

    Proc. Natl Acad. Sci, USA

    (1996)
  • M. Allen et al.

    Computer Simulation of Liquids

    (1987)
  • G. Berriz et al.

    Cooperativity and stability in a langevin model of proteinlike folding

    J. Chem. Phys.

    (1997)
  • K. Binder

    Monte Carlo Methods in Statistical Physics

    (1986)
  • K. Binder

    The Monte Carlo Method in Condensed Matter Physics

    (1995)
  • Cited by (0)

    1

    Edited by F. Cohen

    View full text