Protein structure comparison: implications for the nature of ‘fold space’, and structure and function prediction

https://doi.org/10.1016/j.sbi.2006.04.007Get rights and content

The identification of geometric relationships between protein structures offers a powerful approach to predicting the structure and function of proteins. Methods to detect such relationships range from human pattern recognition to a variety of mathematical algorithms. A number of schemes for the classification of protein structure have found widespread use and these implicitly assume the organization of protein structure space into discrete categories. Recently, an alternative view has emerged in which protein fold space is seen as continuous and multidimensional. Significant relationships have been observed between proteins that belong to what have been termed different ‘folds’. There has been progress in the use of these relationships in the prediction of protein structure and function.

Introduction

Proteins have complex three-dimensional shapes that, by eye, often bear striking similarity to one another over their entire lengths or over shorter regions. In parallel to what can be deduced from pure sequence relationships, structural similarities also suggest the possibility of evolutionary relationships between proteins. Indeed, because it is widely accepted that structure is better conserved than sequence (at least given our current ability to detect sequence relationships), the identification of structural relationships between proteins can provide important structural and functional information not available from sequence analysis alone. However, detecting geometric relationships between proteins is a far more uncertain process than the identification of pure sequence relationships, as the latter can be clearly defined in statistical terms. In contrast, there is considerable ambiguity in how to describe a geometric relationship between two proteins, resulting in the large number of approaches to this problem described in the literature.

One effective but qualitative approach is based on manual pattern recognition. Richardson's [1] classical review of structural motifs in proteins was a striking example that has evolved over the years into manually curated structure classification schemes, as epitomized by the SCOP [2] and CATH [3] databases. Implicit in SCOP and CATH is a hierarchical view whereby ‘structure space’ is divided into isolated, non-overlapping ‘islands’ that are denoted by categories such as folds. It is perhaps surprising that the concept of a fold has entered the vocabulary of structural biology in the complete absence of a clear quantitative measure of how such an entity should be described. Implicit in the hierarchical view is that protein structure space is discrete, in the sense that if a particular protein belongs to one category it does not belong to some other category.

Does the use of inherently rigid classification schemes limit our recognition of important relationships that exist between proteins that have been segregated into different categories? In principle, one could consider overlapping classifications, whereby each object is assigned to multiple classes; unfortunately, there are no overlapping classifications of protein structure space. Indeed, there is growing evidence that protein structure space is continuous, in the sense that there are meaningful structural relationships between proteins that are classified very differently. In this review, we discuss these alternative perspectives, and argue that both hierarchical and continuous views have ranges of validity. We suggest that the development of computational tools and algorithms that recognize both descriptions of structure space can enhance our ability to predict protein structure and function.

Section snippets

Protein structure alignment

Structural alignment programs define scoring functions that measure the geometric similarity between proteins and use various algorithms to search for two substructures such that these functions are optimal. Most existing similarity measures can be classified into two main types depending on what they compare: the distances between corresponding pairs of atoms in the two structures (e.g. [4, 5, 6]); and the relative positions of the corresponding atoms of two proteins that have been

The nature of fold space

SCOP [2] and CATH [3] describe fold space in very similar ways. In SCOP's manual classification, the first two levels, ‘class’ and ‘fold’, are defined based purely on structure; the next level, ‘superfamily’, takes into account both structure and function, and the level below accounts for sequence as well, thus grouping proteins with clear evolutionary relationships. CATH combines manual classification with the automatic structural alignment program SSAP [6]: the topmost level, ‘class’, is

Does the description of fold space matter? Applications

The discrete and the continuous views of fold space have different advantages. The hierarchical classifications of proteins into evolutionarily related sequence families and superfamilies can be carried out in a relatively unambiguous fashion, and have the advantage that they are annotated and validated by experts in the field. Also, the sequence neighbors of every protein are well defined. The organization of this information into well-maintained databases is clearly extremely valuable. The

Conclusions

The increasing number of protein structures in the PDB and the availability of many fast programs that compare protein structures reveal many unsuspected similarities in protein structure space. Traditional discrete hierarchical classification schemes group proteins with clear evolutionary relationships. At the structural level, these classifications constitute an abstraction that groups structures into topologies and folds based on similarities that have been detected based, in part, on visual

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgements

We are grateful to Michael Levitt, Chris Tang, Mickey Kosloff and Burkhard Rost for many helpful discussions on the topics covered in this review. This work was supported in part by the Northeast Structural Genomics Consortium (NESG – GM074958). The thinking reflected in this review has evolved in part as a result of facing the challenges of NESG target selection.

References (43)

  • Y. Zhang et al.

    TASSER: an automated method for the prediction of protein tertiary structures in CASP6

    Proteins

    (2005)
  • Y. Zhang et al.

    The protein structure prediction problem could be solved using the current PDB library

    Proc Natl Acad Sci USA

    (2005)
  • A. Andreeva et al.

    SCOP database in 2004: refinements integrate structure and sequence family data

    Nucleic Acids Res

    (2004)
  • F. Pearl et al.

    The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis

    Nucleic Acids Res

    (2005)
  • I.N. Shindyalov et al.

    Protein structure alignment by incremental combinatorial extension (CE) of the optimal path

    Protein Eng

    (1998)
  • E. Krissinel et al.

    Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions

    Acta Crystallogr D Biol Crystallogr

    (2004)
  • G.J. Kleywegt

    Use of non-crystallographic symmetry in protein structure refinement

    Acta Crystallogr D Biol Crystallogr

    (1996)
  • T. Madej et al.

    Threading a database of protein cores

    Proteins

    (1995)
  • I. Eidhammer et al.

    Structure comparison and structure patterns

    J Comput Biol

    (2000)
  • R. Kolodny et al.

    Approximate protein structural alignment in polynomial time

    Proc Natl Acad Sci USA

    (2004)
  • P. Rogen et al.

    Automatic classification of protein structure by using Gauss integrals

    Proc Natl Acad Sci USA

    (2003)
  • Cited by (135)

    • Searching protein space for ancient sub-domain segments

      2021, Current Opinion in Structural Biology
    • Unravelling the complexity of signalling networks in cancer: A review of the increasing role for computational modelling

      2017, Critical Reviews in Oncology/Hematology
      Citation Excerpt :

      Such motifs have 3D structure and shape, but there also is a parameter called “fold space” relating to protein folding that creates shape (Hou et al., 2003). While protein folding exists in conventional 3D space, this term refers to a description of the complement of folds/structural similarities in a protein expressed for example as a 3D map or computational parameter which enables comparisons of potential interactions/activities between different proteins to be made (Kolodny et al., 2006). In a similar way certain networks are termed “3D”, but actually mean a network that includes structural information, not necessarily one depicted in visual 3D (Lewis et al., 2015).

    • Understand protein functions by comparing the similarity of local structural environments

      2017, Biochimica et Biophysica Acta - Proteins and Proteomics
    View all citing articles on Scopus
    View full text