Tandem repeats in proteins: From sequence to structure
Introduction
Dramatic growth of genomic data presents new challenges for scientists: making sense of millions of protein sequences requires systematic approaches and information about their 3D structure as well as about their evolutionary and functional relationships. The majority of protein sequences are aperiodic and usually have globular 3D structures carrying a number of various functions. The foremost efforts of researchers were devoted to these types of proteins and, as a result, significant progress has been made in the development of bioinformatics tools for their analysis. However, proteins also contain a large portion of periodic sequences representing arrays of repeats that are directly adjacent to each other (Heringa, 1998). These tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues. They are ubiquitous in genomes and occur in at least 14% of all proteins (Marcotte et al., 1999). Furthermore, they are present in almost every third human protein and even in every second protein from Plasmodium falciparum or Dictyostelium discoideum (Pellegrini et al., 1999, Jorda and Kajava, 2010). The tandem repeat regions are highly polymorphic compared to the background rate of point mutations (Buard and Vergnaud, 1994, Ellegren, 2000). The two main mechanisms underlying this hypermutability are: (i) replication slippage within DNA microsatellite regions (repeat is less than 10 nucleotides) and (ii) recombination events for longer minisatellite and satellite regions.
Over the last decade, numerous studies demonstrated the fundamental functional importance of such tandem repeats and their involvement in human diseases. A number of evidences has also been gathered about the high incidence of tandem repeats in the sequences of virulence factors of pathogenic agents, toxins and allergens (Kajava et al., 2006). Genetic instability of these regions can allow a rapid response to environmental changes and thus can lead to emerging infection threats. Furthermore, tandem repeats frequently occur in amyloidogenic and other disease-related sequences (Baxa et al., 2006, Nelson and Eisenberg, 2006). This implies that this class of sequences may have a broader role in human diseases than was previously recognized.
Thus, tandem repeat regions are abundant in proteomes and are related to major health threats of the modern society. Along this line, the discovery of these domains, understanding of their sequence–structure–function relationship and mechanisms of their evolution promise to be a fertile direction for research leading to the identification of targets for new medicines and vaccines. However, the conventional bioinformatics approaches for annotation of proteomes that are developed for globular domains have limited success when applied to the tandem repeat regions. They require special computer programs and databases. Here, I survey available bioinformatics tools for analysis of protein repeats with emphasis on the sequences, 3D structures, sequence–structure relationship as well as highlighting successful strategies for the prediction of the protein structure.
Section snippets
Identification of tandem repeats in protein sequences
The growth of proteomic data has led to increasing efforts to develop methods for protein repeat recognition. Protein tandem repeats are frequently not perfect, containing a number of mutations (substitutions, insertions, deletions) accumulated during evolution, and some of them cannot be easily identified (Fig. 1). To solve this problem, over the last few years, several improved algorithms and software have been developed. They can be subdivided into five general types of methods (Table 1).
Databases and bioinformatics tools for analysis of protein tandem repeats
With the improvement of the methods for identification of protein tandem repeats and the increasing growth of the sequenced genomes we are facing the next significant problem: how to understand this huge amount of data? This requires systematic large scale analysis which can provide insight into sequence motifs, inter-strain variability, structures, functions and evolution of tandem repeats. Popular integrated and well-annotated databases such as UniProt, Pfam, SMART, InterPro can be very
Databases devoted to the 3D structures of proteins with repeats
Simultaneously with the increase of data on repeat sequences, the last years have been marked by an emergence of new 3D structures of these proteins, thanks to improved expression and crystallization strategies. Databases focused on protein structure classifications (PDB, SCOP, CATH and others) do not provide an easy solution for selection of these structures from PDB. As a result, databases specially dedicated to the 3D structures of proteins with tandem repeats have been developed. For
Updated classification of the 3D structures of proteins with repeats
The increasing number of the known structures containing repetitive structural elements necessitates their classification to facilitate further understanding of their sequence–structure–function relationships as well as the evolutionary mechanisms. Ten years ago, a simple classification of these 3D structures based on the repeat length was suggested (Kajava, 2001). The classification withstood the test of time, albeit, today, the appearance of new 3D structures requires and allows its
Structural prediction of proteins with tandem repeats
Proteins with tandem repeats are still under-represented in the PDB in view of the large number of these proteins identified in proteomes. One of the reasons is that the tandem repeat regions are frequently intrinsically unstructured (or disordered) (Marcotte et al., 1999, Kajava, 2001, Tompa, 2003). The genetic instability of tandem repeats, together with the structurally permissive nature of their disordered state, may increase the probability of newly emerged repeats being fixed during
Perspective
To address the challenges related to the exponential increase of data on protein tandem repeats, a number of new bioinformatics tools and databases have been developed including highly sensitive computer programs for identification of repeats in amino acid sequences, databases and tools for their comparative large-scale analysis, computer programs for detection of 3D repetitive units within the structures and tools to deepen our understanding of sequence–structure relationships. It was also
Acknowledgment
I thank Abdullah Ahmed for critical reading of the manuscript and suggestions.
References (107)
- et al.
The contribution of entropy, enthalpy, and hydrophobic desolvation to cooperativity in repeat-protein folding
Structure
(2011) - et al.
Triose-phosphate isomerase (TIM) of the psychrophilic bacterium Vibrio marinus. Kinetic and structural properties
J. Biol. Chem.
(1998) - et al.
Homology-based method for identification of protein repeats using statistical significance estimates
J. Mol. Biol.
(2000) - et al.
Structure, function, and amyloidogenesis of fungal prions: filament polymorphism and prion variants
Adv. Prot. Chem.
(2006) - et al.
Drosophila kelch motif is derived from a common enzyme fold
J. Mol. Biol.
(1994) - et al.
Helianthus tuberosus lectin reveals a widespread scaffold for mannose-binding lectins
Structure
(1999) - et al.
A flexible motif search technique based on generalized profiles
Comput. Chem.
(1996) - et al.
Crystal structure of a rigid four-spectrin-repeat fragment of the human desmoplakin plakin domain
J Mol Biol
(2011) - et al.
New folds for all-beta proteins
Structure
(1993) Microsatellite mutations in the germline: implications for evolutionary inference
Trends Genet.
(2000)
Topological characteristics of helical repeat proteins
Curr. Opin. Struct. Biol.
The extracellular architecture of adherens junctions revealed by crystal structures of type I cadherins
Structure
Detection of internal repeats: how common are they?
Curr. Opin. Struct. Biol.
Structural basis for selective recognition of pneumococcal cell wall by modular endolysin from phage Cp-1
Structure
Protein homorepeats sequences, structures, evolution, and functions
Adv. Prot. Chem. Struct. Biol.
Structural diversity of leucine-rich repeat proteins
J. Mol. Biol.
Review: proteins with repeated sequence–structural prediction and modeling
J. Struct. Biol.
New HEAT-like repeat motifs in proteins regulating proteasome structure and function
J. Struct. Biol.
Beta-structures in fibrous proteins
Adv. Prot. Chem.
Beta-rolls, beta-helices, and other beta-solenoid proteins
Adv. Prot. Chem.
The leucine-rich repeat: a versatile binding motif
Trends Biochem. Sci.
When protein folding is simplified to protein coiling: the continuum of solenoid protein structures
Trends Biochem. Sci.
The leucine-rich repeat as a protein recognition motif
Curr. Opin. Struct. Biol.
Independent movement, dimerization and stability of tandem repeats of chicken brain alpha-spectrin
J. Mol. Biol.
A recurring theme in protein engineering: the design, stability and folding of repeat proteins
Curr. Opin. Struct. Biol.
High-resolution structure of a self-assembly-competent form of a hydrophobic peptide captured in a soluble beta-sheet scaffold
J. Mol. Biol.
A census of protein repeats
J. Mol. Biol.
The 14-fold periodicity in alpha-tropomyosin and the interaction with actin
J. Mol. Biol.
Natural triple beta-stranded fibrous folds
Adv. Prot. Chem.
A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction
Curr. Opin. Struct. Biol.
Beta-Trefoil fold. Patterns of structure and sequence in the Kunitz inhibitors interleukins-1 beta and 1 alpha and fibroblast growth factors
J. Mol. Biol.
Structural models of amyloid-like fibrils
Adv. Prot. Chem.
Structural analysis of the anaphase-promoting complex reveals multiple active sites and insights into polyubiquitylation
Mol. Cell.
Glutamine repeats and neurodegenerative diseases: molecular aspects
Trends Biochem. Sci.
Identification of distant homologues of fibroblast growth factors suggests a common ancestor for all beta-trefoil proteins
J. Mol. Biol.
ProSTRIP: a method to find similar structural repeats in three-dimensional protein structures
Comput. Biol. Chem.
Structure of a Blm10 complex reveals common mechanisms for proteasome binding and gate opening
Mol. Cell.
The Structure of calnexin, an ER chaperone involved in quality control of protein folding
Mol. Cell.
Novel structure of the conserved gram-negative lipopolysaccharide transport protein A and mutagenesis analysis
J. Mol. Biol.
A helical string of alternately connected three-helix bundles for the cell wall-associated adhesion protein Ebh from Staphylococcus aureus
Structure
Swelfe: a detector of internal repeats in sequences and structures
Bioinformatics
The conformational stability of alpha-helical nonpolar polypeptides in solution
Biochemistry
Structure of the bacteriophage T4 long tail fiber receptor-binding tip
Proc. Natl. Acad. Sci. U S A
Structure and distribution of pentapeptide repeats in bacteria
Protein Sci.
Three-dimensional structure of the alkaline protease of Pseudomonas aeruginosa: a two-domain protein with a calcium binding parallel beta roll motif
Embo J.
Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution
Science
De novo identification of highly diverged protein repeats by probabilistic consistency
Bioinformatics
High-affinity binders selected from designed ankyrin repeat protein libraries
Nat. Biotechnol.
Structure of the 26S proteasome from Schizosaccharomyces pombe at subnanometer resolution
Proc. Natl. Acad. Sci. U S A
Complex recombination events at the hypermutable minisatellite CEB1 (D2S90)
Embo J.
Cited by (169)
Animal granulins: In the GRN scheme of things
2024, Developmental and Comparative ImmunologyTandem-repeat proteins conformational mechanics are optimized to facilitate functional interactions and complexations
2024, Current Opinion in Structural BiologyDaisy: An integrated repeat protein curation service
2023, Journal of Structural BiologyA STRP-ed definition of Structured Tandem Repeats in Proteins
2023, Journal of Structural BiologyEngineering of brick and staple components for ordered assembly of synthetic repeat proteins
2023, Journal of Structural Biology