Tandem repeats in proteins: From sequence to structure

https://doi.org/10.1016/j.jsb.2011.08.009Get rights and content

Abstract

The bioinformatics analysis of proteins containing tandem repeats requires special computer programs and databases, since the conventional approaches predominantly developed for globular domains have limited success. Here, I survey bioinformatics tools which have been developed recently for identification and proteome-wide analysis of protein repeats. The last few years have also been marked by an emergence of new 3D structures of these proteins. Appraisal of the known structures and their classification uncovers a straightforward relationship between their architecture and the length of the repetitive units. This relationship and the repetitive character of structural folds suggest rules for better prediction of the 3D structures of such proteins. Furthermore, bioinformatics approaches combined with low resolution structural data, from biophysical techniques, especially, the recently emerged cryo-electron microscopy, lead to reliable prediction of the protein repeat structures and their mode of binding with partners within molecular complexes. This hybrid approach can actively be used for structural and functional annotations of proteomes.

Introduction

Dramatic growth of genomic data presents new challenges for scientists: making sense of millions of protein sequences requires systematic approaches and information about their 3D structure as well as about their evolutionary and functional relationships. The majority of protein sequences are aperiodic and usually have globular 3D structures carrying a number of various functions. The foremost efforts of researchers were devoted to these types of proteins and, as a result, significant progress has been made in the development of bioinformatics tools for their analysis. However, proteins also contain a large portion of periodic sequences representing arrays of repeats that are directly adjacent to each other (Heringa, 1998). These tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues. They are ubiquitous in genomes and occur in at least 14% of all proteins (Marcotte et al., 1999). Furthermore, they are present in almost every third human protein and even in every second protein from Plasmodium falciparum or Dictyostelium discoideum (Pellegrini et al., 1999, Jorda and Kajava, 2010). The tandem repeat regions are highly polymorphic compared to the background rate of point mutations (Buard and Vergnaud, 1994, Ellegren, 2000). The two main mechanisms underlying this hypermutability are: (i) replication slippage within DNA microsatellite regions (repeat is less than 10 nucleotides) and (ii) recombination events for longer minisatellite and satellite regions.

Over the last decade, numerous studies demonstrated the fundamental functional importance of such tandem repeats and their involvement in human diseases. A number of evidences has also been gathered about the high incidence of tandem repeats in the sequences of virulence factors of pathogenic agents, toxins and allergens (Kajava et al., 2006). Genetic instability of these regions can allow a rapid response to environmental changes and thus can lead to emerging infection threats. Furthermore, tandem repeats frequently occur in amyloidogenic and other disease-related sequences (Baxa et al., 2006, Nelson and Eisenberg, 2006). This implies that this class of sequences may have a broader role in human diseases than was previously recognized.

Thus, tandem repeat regions are abundant in proteomes and are related to major health threats of the modern society. Along this line, the discovery of these domains, understanding of their sequence–structure–function relationship and mechanisms of their evolution promise to be a fertile direction for research leading to the identification of targets for new medicines and vaccines. However, the conventional bioinformatics approaches for annotation of proteomes that are developed for globular domains have limited success when applied to the tandem repeat regions. They require special computer programs and databases. Here, I survey available bioinformatics tools for analysis of protein repeats with emphasis on the sequences, 3D structures, sequence–structure relationship as well as highlighting successful strategies for the prediction of the protein structure.

Section snippets

Identification of tandem repeats in protein sequences

The growth of proteomic data has led to increasing efforts to develop methods for protein repeat recognition. Protein tandem repeats are frequently not perfect, containing a number of mutations (substitutions, insertions, deletions) accumulated during evolution, and some of them cannot be easily identified (Fig. 1). To solve this problem, over the last few years, several improved algorithms and software have been developed. They can be subdivided into five general types of methods (Table 1).

Databases and bioinformatics tools for analysis of protein tandem repeats

With the improvement of the methods for identification of protein tandem repeats and the increasing growth of the sequenced genomes we are facing the next significant problem: how to understand this huge amount of data? This requires systematic large scale analysis which can provide insight into sequence motifs, inter-strain variability, structures, functions and evolution of tandem repeats. Popular integrated and well-annotated databases such as UniProt, Pfam, SMART, InterPro can be very

Databases devoted to the 3D structures of proteins with repeats

Simultaneously with the increase of data on repeat sequences, the last years have been marked by an emergence of new 3D structures of these proteins, thanks to improved expression and crystallization strategies. Databases focused on protein structure classifications (PDB, SCOP, CATH and others) do not provide an easy solution for selection of these structures from PDB. As a result, databases specially dedicated to the 3D structures of proteins with tandem repeats have been developed. For

Updated classification of the 3D structures of proteins with repeats

The increasing number of the known structures containing repetitive structural elements necessitates their classification to facilitate further understanding of their sequence–structure–function relationships as well as the evolutionary mechanisms. Ten years ago, a simple classification of these 3D structures based on the repeat length was suggested (Kajava, 2001). The classification withstood the test of time, albeit, today, the appearance of new 3D structures requires and allows its

Structural prediction of proteins with tandem repeats

Proteins with tandem repeats are still under-represented in the PDB in view of the large number of these proteins identified in proteomes. One of the reasons is that the tandem repeat regions are frequently intrinsically unstructured (or disordered) (Marcotte et al., 1999, Kajava, 2001, Tompa, 2003). The genetic instability of tandem repeats, together with the structurally permissive nature of their disordered state, may increase the probability of newly emerged repeats being fixed during

Perspective

To address the challenges related to the exponential increase of data on protein tandem repeats, a number of new bioinformatics tools and databases have been developed including highly sensitive computer programs for identification of repeats in amino acid sequences, databases and tools for their comparative large-scale analysis, computer programs for detection of 3D repetitive units within the structures and tools to deepen our understanding of sequence–structure relationships. It was also

Acknowledgment

I thank Abdullah Ahmed for critical reading of the manuscript and suggestions.

References (107)

  • M.R. Groves et al.

    Topological characteristics of helical repeat proteins

    Curr. Opin. Struct. Biol.

    (1999)
  • O.J. Harrison et al.

    The extracellular architecture of adherens junctions revealed by crystal structures of type I cadherins

    Structure

    (2011)
  • J. Heringa

    Detection of internal repeats: how common are they?

    Curr. Opin. Struct. Biol.

    (1998)
  • J.A. Hermoso et al.

    Structural basis for selective recognition of pneumococcal cell wall by modular endolysin from phage Cp-1

    Structure

    (2003)
  • J. Jorda et al.

    Protein homorepeats sequences, structures, evolution, and functions

    Adv. Prot. Chem. Struct. Biol.

    (2010)
  • A.V. Kajava

    Structural diversity of leucine-rich repeat proteins

    J. Mol. Biol.

    (1998)
  • A.V. Kajava

    Review: proteins with repeated sequence–structural prediction and modeling

    J. Struct. Biol.

    (2001)
  • A.V. Kajava et al.

    New HEAT-like repeat motifs in proteins regulating proteasome structure and function

    J. Struct. Biol.

    (2004)
  • A.V. Kajava et al.

    Beta-structures in fibrous proteins

    Adv. Prot. Chem.

    (2006)
  • A.V. Kajava et al.

    Beta-rolls, beta-helices, and other beta-solenoid proteins

    Adv. Prot. Chem.

    (2006)
  • B. Kobe et al.

    The leucine-rich repeat: a versatile binding motif

    Trends Biochem. Sci.

    (1994)
  • B. Kobe et al.

    When protein folding is simplified to protein coiling: the continuum of solenoid protein structures

    Trends Biochem. Sci.

    (2000)
  • B. Kobe et al.

    The leucine-rich repeat as a protein recognition motif

    Curr. Opin. Struct. Biol.

    (2001)
  • H. Kusunoki et al.

    Independent movement, dimerization and stability of tandem repeats of chicken brain alpha-spectrin

    J. Mol. Biol.

    (2004)
  • E.R. Main et al.

    A recurring theme in protein engineering: the design, stability and folding of repeat proteins

    Curr. Opin. Struct. Biol.

    (2005)
  • K. Makabe et al.

    High-resolution structure of a self-assembly-competent form of a hydrophobic peptide captured in a soluble beta-sheet scaffold

    J. Mol. Biol.

    (2008)
  • E.M. Marcotte et al.

    A census of protein repeats

    J. Mol. Biol.

    (1999)
  • A.D. McLachlan et al.

    The 14-fold periodicity in alpha-tropomyosin and the interaction with actin

    J. Mol. Biol.

    (1976)
  • A. Mitraki et al.

    Natural triple beta-stranded fibrous folds

    Adv. Prot. Chem.

    (2006)
  • J. Moult

    A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction

    Curr. Opin. Struct. Biol.

    (2005)
  • A.G. Murzin et al.

    Beta-Trefoil fold. Patterns of structure and sequence in the Kunitz inhibitors interleukins-1 beta and 1 alpha and fibroblast growth factors

    J. Mol. Biol.

    (1992)
  • R. Nelson et al.

    Structural models of amyloid-like fibrils

    Adv. Prot. Chem.

    (2006)
  • L.A. Passmore et al.

    Structural analysis of the anaphase-promoting complex reveals multiple active sites and insights into polyubiquitylation

    Mol. Cell.

    (2005)
  • M.F. Perutz

    Glutamine repeats and neurodegenerative diseases: molecular aspects

    Trends Biochem. Sci.

    (1999)
  • C.P. Ponting et al.

    Identification of distant homologues of fibroblast growth factors suggests a common ancestor for all beta-trefoil proteins

    J. Mol. Biol.

    (2000)
  • R. Sabarinathan et al.

    ProSTRIP: a method to find similar structural repeats in three-dimensional protein structures

    Comput. Biol. Chem.

    (2010)
  • K. Sadre-Bazzaz et al.

    Structure of a Blm10 complex reveals common mechanisms for proteasome binding and gate opening

    Mol. Cell.

    (2010)
  • J.D. Schrag et al.

    The Structure of calnexin, an ER chaperone involved in quality control of protein folding

    Mol. Cell.

    (2001)
  • M.D. Suits et al.

    Novel structure of the conserved gram-negative lipopolysaccharide transport protein A and mutagenesis analysis

    J. Mol. Biol.

    (2008)
  • Y. Tanaka et al.

    A helical string of alternately connected three-helix bundles for the cell wall-associated adhesion protein Ebh from Staphylococcus aureus

    Structure

    (2008)
  • A.L. Abraham et al.

    Swelfe: a detector of internal repeats in sequences and structures

    Bioinformatics

    (2008)
  • H.E. Auer et al.

    The conformational stability of alpha-helical nonpolar polypeptides in solution

    Biochemistry

    (1966)
  • S.G. Bartual et al.

    Structure of the bacteriophage T4 long tail fiber receptor-binding tip

    Proc. Natl. Acad. Sci. U S A

    (2010)
  • A. Bateman et al.

    Structure and distribution of pentapeptide repeats in bacteria

    Protein Sci.

    (1998)
  • U. Baumann et al.

    Three-dimensional structure of the alkaline protease of Pseudomonas aeruginosa: a two-domain protein with a calcium binding parallel beta roll motif

    Embo J.

    (1993)
  • J. Bella et al.

    Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution

    Science

    (1994)
  • A. Biegert et al.

    De novo identification of highly diverged protein repeats by probabilistic consistency

    Bioinformatics

    (2008)
  • H.K. Binz et al.

    High-affinity binders selected from designed ankyrin repeat protein libraries

    Nat. Biotechnol.

    (2004)
  • S. Bohn et al.

    Structure of the 26S proteasome from Schizosaccharomyces pombe at subnanometer resolution

    Proc. Natl. Acad. Sci. U S A

    (2010)
  • J. Buard et al.

    Complex recombination events at the hypermutable minisatellite CEB1 (D2S90)

    Embo J.

    (1994)
  • Cited by (169)

    • Animal granulins: In the GRN scheme of things

      2024, Developmental and Comparative Immunology
    • Daisy: An integrated repeat protein curation service

      2023, Journal of Structural Biology
    View all citing articles on Scopus
    View full text