Whole genome protein domain analysis using a new method for domain clustering
Introduction
The multi-domain combinatorial nature of many proteins makes it a challenge to comprehend protein diversity and to interpret the large sets of new sequences stemming from systematic genome sequencing. Such an interpretation is crucial for a useful annotation of genomic sequence data, and must rely on (1) a consistent classification of domain families, (2) efficient tools to detect both known and novel domains in every sequenced protein, and (3) a powerful human interface to access and comprehend the results of this automated analysis. Efforts towards a consistent classification of protein domains are underway with the construction of databases such as ProDom (Sonnhammer and Kahn, 1994, Corpet et al., 1999), PFAM (Sonnhammer et al., 1997) or DOMO (Gracy and Argos, 1998). Useful graphical interfaces are provided, for instance on the World Wide Web or with the XDOM program (Gouzy et al., 1997). Automatic domain detection algorithms have been developed, such as DOMAINER (Sonnhammer and Kahn, 1994), MKDOM (Gouzy et al., 1997) or DIVCLUS (Park and Teichmann, 1998). Here we use a new, more efficient domain clustering algorithm in order to systematically analyse domain families in fully sequenced available genomes.
Section snippets
MKDOM version 2
Recently Altschul et al. (1997) published the PSI-BLAST program which allows to elegantly recruit sets of homologous proteins using an iterative position-specific score matrix. PSI-BLAST cannot be used directly to build domain families in a fully automatic way, because multi-domain sequences yield heterogeneous sequence sets. However, if sequence databases are queried with a single domain, then PSI-BLAST should allow for the rapid generation of the corresponding homogeneous domain family. We
Whole genome domain analysis
In order to systematically analyse domain families and domain arrangements in various bacterial, archaeal and eukaryotic organisms, we applied the algorithm outlined above on a set of 38,440 sequences encompassing all known or predicted protein sequences encoded by 17 whole genomes. These include 12 bacteria (Aquifex aeolicus, Bacillus subtilis, Borrelia burgdorferi, Chlamydia trachomatis, Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Mycobacterium tuberculosis, Mycoplasma
Statistics of multi-domain proteins in various genomes
The analysis above allowed us to calculate the distribution of multi-domain proteins for each of the 17 available complete genomes (Fig. 2). In general, the number of proteins containing k or more domains decreases exponentially with k. There are, however, a few exceptions, the most notable being B. subtilis and M. tuberculosis, which show biphasic distributions with abnormally large sets of highly multi-domain proteins. Extreme multi-domain proteins in B. subtilis include polyketide synthase
Domain shuffling
The analysis above allowed to identify which domains are highly shuffled on the N-terminal side, on the C-terminal side or on both sides. We find among the most shuffled domains:
- •
the TPR domain which neighbours 144 and 124 different domain types on the N-terminal and C-terminal sides, respectively; it is frequently tandemly repeated;
- •
the ‘two-component’ receiver domain;
- •
ferredoxin domains.
Acknowledgements
We wish to thank Claude Chevalet for stimulating discussions. This work was supported in part by the Centre National de la Recherche Scientifique (Genome Initiative) and the European Union (Biotech BIO4-CT980052).
References (13)
- et al.
Basic local alignment search tool
J. Mol. Biol
(1990) - et al.
A repeating amino acid motif in CDC23 defines a family of proteins and a new relationship among genes required for mitosis and RNA synthesis
Cell
(1990) - et al.
Analysis of compositionally biased regions in sequence databases
Methods Enzymol
(1996) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res
(1997)Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
Nature
(1998)Multiple sequence alignment with hierarchical clustering
Nucleic Acids Res
(1988)
Cited by (58)
Establishing relationships among patterns in stock market data
2009, Data and Knowledge EngineeringDomain boundary prediction based on profile domain linker propensity index
2006, Computational Biology and ChemistryThe expanding role of poly(ADP-ribose) metabolism: Current challenges and new perspectives
2006, Current Opinion in Cell BiologyStructure-guided approach for detecting large domain inserts in protein sequences as illustrated using the haloacid dehalogenase superfamily
2014, Proteins: Structure, Function and Bioinformatics