Tandem repeats in proteins: From sequence to structure

doi:10.1016/j.jsb.2011.08.009

Journal of Structural Biology

Volume 179, Issue 3, September 2012, Pages 279-288

https://doi.org/10.1016/j.jsb.2011.08.009 Get rights and content

Abstract

The bioinformatics analysis of proteins containing tandem repeats requires special computer programs and databases, since the conventional approaches predominantly developed for globular domains have limited success. Here, I survey bioinformatics tools which have been developed recently for identification and proteome-wide analysis of protein repeats. The last few years have also been marked by an emergence of new 3D structures of these proteins. Appraisal of the known structures and their classification uncovers a straightforward relationship between their architecture and the length of the repetitive units. This relationship and the repetitive character of structural folds suggest rules for better prediction of the 3D structures of such proteins. Furthermore, bioinformatics approaches combined with low resolution structural data, from biophysical techniques, especially, the recently emerged cryo-electron microscopy, lead to reliable prediction of the protein repeat structures and their mode of binding with partners within molecular complexes. This hybrid approach can actively be used for structural and functional annotations of proteomes.

Introduction

Dramatic growth of genomic data presents new challenges for scientists: making sense of millions of protein sequences requires systematic approaches and information about their 3D structure as well as about their evolutionary and functional relationships. The majority of protein sequences are aperiodic and usually have globular 3D structures carrying a number of various functions. The foremost efforts of researchers were devoted to these types of proteins and, as a result, significant progress has been made in the development of bioinformatics tools for their analysis. However, proteins also contain a large portion of periodic sequences representing arrays of repeats that are directly adjacent to each other (Heringa, 1998). These tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues. They are ubiquitous in genomes and occur in at least 14% of all proteins (Marcotte et al., 1999). Furthermore, they are present in almost every third human protein and even in every second protein from Plasmodium falciparum or Dictyostelium discoideum (Pellegrini et al., 1999, Jorda and Kajava, 2010). The tandem repeat regions are highly polymorphic compared to the background rate of point mutations (Buard and Vergnaud, 1994, Ellegren, 2000). The two main mechanisms underlying this hypermutability are: (i) replication slippage within DNA microsatellite regions (repeat is less than 10 nucleotides) and (ii) recombination events for longer minisatellite and satellite regions.

Over the last decade, numerous studies demonstrated the fundamental functional importance of such tandem repeats and their involvement in human diseases. A number of evidences has also been gathered about the high incidence of tandem repeats in the sequences of virulence factors of pathogenic agents, toxins and allergens (Kajava et al., 2006). Genetic instability of these regions can allow a rapid response to environmental changes and thus can lead to emerging infection threats. Furthermore, tandem repeats frequently occur in amyloidogenic and other disease-related sequences (Baxa et al., 2006, Nelson and Eisenberg, 2006). This implies that this class of sequences may have a broader role in human diseases than was previously recognized.

Thus, tandem repeat regions are abundant in proteomes and are related to major health threats of the modern society. Along this line, the discovery of these domains, understanding of their sequence–structure–function relationship and mechanisms of their evolution promise to be a fertile direction for research leading to the identification of targets for new medicines and vaccines. However, the conventional bioinformatics approaches for annotation of proteomes that are developed for globular domains have limited success when applied to the tandem repeat regions. They require special computer programs and databases. Here, I survey available bioinformatics tools for analysis of protein repeats with emphasis on the sequences, 3D structures, sequence–structure relationship as well as highlighting successful strategies for the prediction of the protein structure.

Section snippets

Identification of tandem repeats in protein sequences

The growth of proteomic data has led to increasing efforts to develop methods for protein repeat recognition. Protein tandem repeats are frequently not perfect, containing a number of mutations (substitutions, insertions, deletions) accumulated during evolution, and some of them cannot be easily identified (Fig. 1). To solve this problem, over the last few years, several improved algorithms and software have been developed. They can be subdivided into five general types of methods (Table 1).

Databases and bioinformatics tools for analysis of protein tandem repeats

With the improvement of the methods for identification of protein tandem repeats and the increasing growth of the sequenced genomes we are facing the next significant problem: how to understand this huge amount of data? This requires systematic large scale analysis which can provide insight into sequence motifs, inter-strain variability, structures, functions and evolution of tandem repeats. Popular integrated and well-annotated databases such as UniProt, Pfam, SMART, InterPro can be very

Databases devoted to the 3D structures of proteins with repeats

Simultaneously with the increase of data on repeat sequences, the last years have been marked by an emergence of new 3D structures of these proteins, thanks to improved expression and crystallization strategies. Databases focused on protein structure classifications (PDB, SCOP, CATH and others) do not provide an easy solution for selection of these structures from PDB. As a result, databases specially dedicated to the 3D structures of proteins with tandem repeats have been developed. For

Updated classification of the 3D structures of proteins with repeats

The increasing number of the known structures containing repetitive structural elements necessitates their classification to facilitate further understanding of their sequence–structure–function relationships as well as the evolutionary mechanisms. Ten years ago, a simple classification of these 3D structures based on the repeat length was suggested (Kajava, 2001). The classification withstood the test of time, albeit, today, the appearance of new 3D structures requires and allows its

Structural prediction of proteins with tandem repeats

Proteins with tandem repeats are still under-represented in the PDB in view of the large number of these proteins identified in proteomes. One of the reasons is that the tandem repeat regions are frequently intrinsically unstructured (or disordered) (Marcotte et al., 1999, Kajava, 2001, Tompa, 2003). The genetic instability of tandem repeats, together with the structurally permissive nature of their disordered state, may increase the probability of newly emerged repeats being fixed during

Perspective

To address the challenges related to the exponential increase of data on protein tandem repeats, a number of new bioinformatics tools and databases have been developed including highly sensitive computer programs for identification of repeats in amino acid sequences, databases and tools for their comparative large-scale analysis, computer programs for detection of 3D repetitive units within the structures and tools to deepen our understanding of sequence–structure relationships. It was also

Acknowledgment

I thank Abdullah Ahmed for critical reading of the manuscript and suggestions.

References (107)

T. Aksel et al.
The contribution of entropy, enthalpy, and hydrophobic desolvation to cooperativity in repeat-protein folding
Structure
(2011)
M. Alvarez et al.
Triose-phosphate isomerase (TIM) of the psychrophilic bacterium Vibrio marinus. Kinetic and structural properties
J. Biol. Chem.
(1998)
M.A. Andrade et al.
Homology-based method for identification of protein repeats using statistical significance estimates
J. Mol. Biol.
(2000)
U. Baxa et al.
Structure, function, and amyloidogenesis of fungal prions: filament polymorphism and prion variants
Adv. Prot. Chem.
(2006)
P. Bork et al.
Drosophila kelch motif is derived from a common enzyme fold
J. Mol. Biol.
(1994)
Y. Bourne et al.
Helianthus tuberosus lectin reveals a widespread scaffold for mannose-binding lectins
Structure
(1999)
P. Bucher et al.
A flexible motif search technique based on generalized profiles
Comput. Chem.
(1996)
H.J. Choi et al.
Crystal structure of a rigid four-spectrin-repeat fragment of the human desmoplakin plakin domain
J Mol Biol
(2011)
C. Chothia et al.
New folds for all-beta proteins
Structure
(1993)
H. Ellegren
Microsatellite mutations in the germline: implications for evolutionary inference
Trends Genet.
(2000)

M.R. Groves et al.

Topological characteristics of helical repeat proteins

Curr. Opin. Struct. Biol.

(1999)

O.J. Harrison et al.

The extracellular architecture of adherens junctions revealed by crystal structures of type I cadherins

Structure

(2011)

J. Heringa

Detection of internal repeats: how common are they?

Curr. Opin. Struct. Biol.

(1998)

J.A. Hermoso et al.

Structural basis for selective recognition of pneumococcal cell wall by modular endolysin from phage Cp-1

Structure

(2003)

J. Jorda et al.

Protein homorepeats sequences, structures, evolution, and functions

Adv. Prot. Chem. Struct. Biol.

(2010)

A.V. Kajava

Structural diversity of leucine-rich repeat proteins

J. Mol. Biol.

(1998)

A.V. Kajava

Review: proteins with repeated sequence–structural prediction and modeling

J. Struct. Biol.

(2001)

A.V. Kajava et al.

New HEAT-like repeat motifs in proteins regulating proteasome structure and function

J. Struct. Biol.

(2004)

A.V. Kajava et al.

Beta-structures in fibrous proteins

Adv. Prot. Chem.

(2006)

A.V. Kajava et al.

Beta-rolls, beta-helices, and other beta-solenoid proteins

Adv. Prot. Chem.

(2006)

B. Kobe et al.

The leucine-rich repeat: a versatile binding motif

Trends Biochem. Sci.

(1994)

B. Kobe et al.

When protein folding is simplified to protein coiling: the continuum of solenoid protein structures

Trends Biochem. Sci.

(2000)

B. Kobe et al.

The leucine-rich repeat as a protein recognition motif

Curr. Opin. Struct. Biol.

(2001)

H. Kusunoki et al.

Independent movement, dimerization and stability of tandem repeats of chicken brain alpha-spectrin

J. Mol. Biol.

(2004)

E.R. Main et al.

A recurring theme in protein engineering: the design, stability and folding of repeat proteins

Curr. Opin. Struct. Biol.

(2005)

K. Makabe et al.

High-resolution structure of a self-assembly-competent form of a hydrophobic peptide captured in a soluble beta-sheet scaffold

J. Mol. Biol.

(2008)

E.M. Marcotte et al.

A census of protein repeats

J. Mol. Biol.

(1999)

A.D. McLachlan et al.

The 14-fold periodicity in alpha-tropomyosin and the interaction with actin

J. Mol. Biol.

(1976)

A. Mitraki et al.

Natural triple beta-stranded fibrous folds

Adv. Prot. Chem.

(2006)

J. Moult

A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction

Curr. Opin. Struct. Biol.

(2005)

A.G. Murzin et al.

Beta-Trefoil fold. Patterns of structure and sequence in the Kunitz inhibitors interleukins-1 beta and 1 alpha and fibroblast growth factors

J. Mol. Biol.

(1992)

R. Nelson et al.

Structural models of amyloid-like fibrils

Adv. Prot. Chem.

(2006)

L.A. Passmore et al.

Structural analysis of the anaphase-promoting complex reveals multiple active sites and insights into polyubiquitylation

Mol. Cell.

(2005)

M.F. Perutz

Glutamine repeats and neurodegenerative diseases: molecular aspects

Trends Biochem. Sci.

(1999)

C.P. Ponting et al.

Identification of distant homologues of fibroblast growth factors suggests a common ancestor for all beta-trefoil proteins

J. Mol. Biol.

(2000)

R. Sabarinathan et al.

ProSTRIP: a method to find similar structural repeats in three-dimensional protein structures

Comput. Biol. Chem.

(2010)

K. Sadre-Bazzaz et al.

Structure of a Blm10 complex reveals common mechanisms for proteasome binding and gate opening

Mol. Cell.

(2010)

J.D. Schrag et al.

The Structure of calnexin, an ER chaperone involved in quality control of protein folding

Mol. Cell.

(2001)

M.D. Suits et al.

Novel structure of the conserved gram-negative lipopolysaccharide transport protein A and mutagenesis analysis

J. Mol. Biol.

(2008)

Y. Tanaka et al.

A helical string of alternately connected three-helix bundles for the cell wall-associated adhesion protein Ebh from Staphylococcus aureus

Structure

(2008)

A.L. Abraham et al.

Swelfe: a detector of internal repeats in sequences and structures

Bioinformatics

(2008)

H.E. Auer et al.

The conformational stability of alpha-helical nonpolar polypeptides in solution

Biochemistry

(1966)

S.G. Bartual et al.

Structure of the bacteriophage T4 long tail fiber receptor-binding tip

Proc. Natl. Acad. Sci. U S A

(2010)

A. Bateman et al.

Structure and distribution of pentapeptide repeats in bacteria

Protein Sci.

(1998)

U. Baumann et al.

Three-dimensional structure of the alkaline protease of Pseudomonas aeruginosa: a two-domain protein with a calcium binding parallel beta roll motif

Embo J.

(1993)

J. Bella et al.

Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution

Science

(1994)

A. Biegert et al.

De novo identification of highly diverged protein repeats by probabilistic consistency

Bioinformatics

(2008)

H.K. Binz et al.

High-affinity binders selected from designed ankyrin repeat protein libraries

Nat. Biotechnol.

(2004)

S. Bohn et al.

Structure of the 26S proteasome from Schizosaccharomyces pombe at subnanometer resolution

Proc. Natl. Acad. Sci. U S A

(2010)

J. Buard et al.

Complex recombination events at the hypermutable minisatellite CEB1 (D2S90)

Embo J.

(1994)

Cited by (169)

Variation in base composition, structure-function relationships, and origins of structural repetition in bacterial rpsA gene
2024, BioSystems
Protein domain repeats are known to arise due to tandem duplications of internal genes. However, the understanding of the underlying mechanisms of this process is incomplete. The goal of this work was to investigate the mechanism of occurrence of repeat expansion based on studying the sequences of 1324 rpsA genes of bacterial S1 ribosomal proteins containing different numbers of S1 structural domains. The rpsA gene encodes ribosomal S1 protein, which is essential for cell viability as it interacts with both mRNA and proteins. Gene ontology (GO) analysis of S1 domains in ribosomal S1 proteins revealed that bacterial protein sequences in S1 mainly have 3 types of molecular functions: RNA binding activity, nucleic acid activity, and ribosome structural component. Our results show that the maximum value of rpsA gene identity for full-length proteins was found for S1 proteins containing six structural domains (58%). Analysis of consensus sequences showed that parts of the rpsA gene encoding separate S1 domains have no a strictly repetitive structure between groups containing different numbers of S1 domains. At the same time, gene regions encoding some conserved residues that form the RNA-binding site remain conserved. The detected phylogenetic similarity suggests that the proposed fold of the rpsA translation initiation region of Escherichia coli has functional value and is important for translational control of rpsA gene expression in other bacterial phyla, but not only in gamma Proteobacteria.
Animal granulins: In the GRN scheme of things
2024, Developmental and Comparative Immunology
Granulins are conserved in nearly all metazoans, with an intriguing loss in insects. These pleiotropic peptides are involved in numerous physiological and pathological processes yet have been overwhelmingly examined in mammalian systems. While work in other animal models has been informative, a richer understanding of the proteins should be obtained by integrating knowledge from all available contexts. The main bodies of work described here include 1) the structure-function relationships of progranulin and its cleavage products, 2) the role of expanded granulin gene families and different isoforms in fish immunology, 3) the release of granulin peptides to promote host angiogenesis by parasitic worms, 4) a diversity of molluscan uses for granulins, including immune activation in intermediate hosts to trematodes, 5) knowledge gained on lysosomal functions from C. elegans and the stress-related activities of granulins. We provide an overview of functional reports across the Metazoa to inform much-needed future research.
Tandem-repeat proteins conformational mechanics are optimized to facilitate functional interactions and complexations
2024, Current Opinion in Structural Biology
The architectures of tandem-repeat proteins are distinct from those of globular proteins. Individual modules, each comprising small structural motifs of 20–40 residues, are arrayed in a quasi one-dimensional fashion to form striking, elongated, horseshoe-like, and superhelical architectures, stabilized solely by short–range interaction. The spring-like shapes of repeat arrays point to elastic modes of action, and these proteins function as adapter molecules or ‘hubs,’ propagating signals within multi-subunit assemblies in diverse biological contexts. This flexibility is apparent in the dramatic variability observed in the structures of tandem-repeat proteins in different complexes. Here, using computational analysis, we demonstrate the striking ability of just one or a few global motions to recapitulate these structures. These findings show how the mechanics of repeat arrays are robustly enabled by their unique architecture. Thus, the repeating architecture has been optimized by evolution to favor functional modes of motions. The global motions enabling functional transitions can be fully visualized at http://bahargroup.org/tr_web.
Daisy: An integrated repeat protein curation service
2023, Journal of Structural Biology
Tandem repeats in proteins identification, classification and curation is a complex process that requires manual processing from experts, processing power and time. There are recent and relevant advances applying machine learning for protein structure prediction and repeat classification that are useful for this process. However, no service contemplates required databases and software to supplement researching on repeat proteins. In this publication we present Daisy, an integrated repeat protein curation web service. This service can process Protein Data Bank (PDB) and the AlphaFold Database entries for tandem repeats identification. In addition, it uses an algorithm to search a sequence against a library of Pfam hidden Markov model (HMM). Repeat classifications are associated with the identified families through RepeatsDB. This prediction is considered for enhancing the ReUPred algorithm execution and hastening the repeat units identification process. The service can also operate every associated PDB and AlphaFold structure with a UniProt proteome registry.
Availability: The Daisy web service is freely accessible at daisy.bioinformatica.org.
A STRP-ed definition of Structured Tandem Repeats in Proteins
2023, Journal of Structural Biology
Tandem Repeat Proteins (TRPs) are a class of proteins with repetitive amino acid sequences that have been studied extensively for over two decades. Different features at the level of sequence, structure, function and evolution have been attributed to them by various authors. And yet many of its salient features appear only when looking at specific subclasses of protein tandem repeats. Here, we attempt to rationalize the existing knowledge on Tandem Repeat Proteins (TRPs) by pointing out several dichotomies. The emerging picture is more nuanced than generally assumed and allows us to draw some boundaries of what is not a “proper” TRP. We conclude with an operational definition of a specific subset, which we have denominated STRPs (Structural Tandem Repeat Proteins), which separates a subclass of tandem repeats with distinctive features from several other less well-defined types of repeats. We believe that this definition will help researchers in the field to better characterize the biological meaning of this large yet largely understudied group of proteins.
Engineering of brick and staple components for ordered assembly of synthetic repeat proteins
2023, Journal of Structural Biology
Synthetic ɑRep repeat proteins are engineered as Brick and Staple protein pairs that together self-assemble into helical filaments. In most cases, the filaments spontaneously form supercrystals. Here, we describe an expanded series of ɑRep Bricks designed to stabilize the interaction between consecutive Bricks, to control the length of the assembled multimers, or to alter the spatial distribution of the Staple on the filaments. The effects of these Brick modifications on the assembly, on the final filament structure and on the crystal symmetry are analyzed by biochemical methods, electron microscopy and small angle X-ray scattering. We further extend the concept of Brick/Staple protein origami by designing a new type of “Janus”-like Brick protein that is equally assembled by orthogonal staples binding its inner or outer surfaces and thus ending inside or outside the filaments. The relative roles of longitudinal and lateral associations in the assembly process are discussed. This set of results demonstrates important proofs-of-principle for engineering these remarkably versatile proteins toward nanometer-to-micron scale constructions.

View all citing articles on Scopus

View full text

Tandem repeats in proteins: From sequence to structure

Abstract

Introduction

Section snippets

Identification of tandem repeats in protein sequences

Databases and bioinformatics tools for analysis of protein tandem repeats

Databases devoted to the 3D structures of proteins with repeats

Updated classification of the 3D structures of proteins with repeats

Structural prediction of proteins with tandem repeats

Perspective

Acknowledgment

Structure

J. Biol. Chem.

J. Mol. Biol.

Adv. Prot. Chem.

J. Mol. Biol.

Structure

Comput. Chem.

J Mol Biol

Structure

Trends Genet.

Curr. Opin. Struct. Biol.

Structure

Curr. Opin. Struct. Biol.

Structure

Adv. Prot. Chem. Struct. Biol.

J. Mol. Biol.

J. Struct. Biol.

J. Struct. Biol.

Adv. Prot. Chem.

Adv. Prot. Chem.

Trends Biochem. Sci.

Trends Biochem. Sci.

Curr. Opin. Struct. Biol.

J. Mol. Biol.

Curr. Opin. Struct. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Adv. Prot. Chem.

Curr. Opin. Struct. Biol.

J. Mol. Biol.

Adv. Prot. Chem.

Mol. Cell.

Trends Biochem. Sci.

J. Mol. Biol.

Comput. Biol. Chem.

Mol. Cell.

Mol. Cell.

J. Mol. Biol.

Structure

Swelfe: a detector of internal repeats in sequences and structures

Bioinformatics

The conformational stability of alpha-helical nonpolar polypeptides in solution

Biochemistry

Structure of the bacteriophage T4 long tail fiber receptor-binding tip

Proc. Natl. Acad. Sci. U S A

Structure and distribution of pentapeptide repeats in bacteria

Protein Sci.

Three-dimensional structure of the alkaline protease of Pseudomonas aeruginosa: a two-domain protein with a calcium binding parallel beta roll motif

Embo J.

Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution

Science

De novo identification of highly diverged protein repeats by probabilistic consistency

Bioinformatics

High-affinity binders selected from designed ankyrin repeat protein libraries

Nat. Biotechnol.

Structure of the 26S proteasome from Schizosaccharomyces pombe at subnanometer resolution

Proc. Natl. Acad. Sci. U S A

Complex recombination events at the hypermutable minisatellite CEB1 (D2S90)

Embo J.