Whole genome protein domain analysis using a new method for domain clustering

https://doi.org/10.1016/S0097-8485(99)00011-XGet rights and content

Abstract

We present the outcome of a systematic analysis of protein domain shuffling in 17 completed microbial genomes. This analysis has been performed using MKDOM Version 2, a completely new version of the domain clustering program MKDOM based on PSI-BLAST recursive homology searches. It allows to delineate the most frequent protein domain building blocks, which domains are found specifically in Bacteria, Archaea or yeast, and which domains are shared between two or all three domains of life. The latter are good candidates as the basic protein building blocks underlying all forms of cellular life. Statistics of multi-domain proteins indicate that some organisms such as Bacillus subtilis or Mycobacterium tuberculosis contain an abnormally high number of large multi-domain proteins. We also provide examples of highly shuffled or circularly permutated domains. A WWW graphical interface has been made available to interactively browse domain arrangements of proteins in all 17 genomes, at http://www.toulouse.inra.fr/prodomCG.html.

Introduction

The multi-domain combinatorial nature of many proteins makes it a challenge to comprehend protein diversity and to interpret the large sets of new sequences stemming from systematic genome sequencing. Such an interpretation is crucial for a useful annotation of genomic sequence data, and must rely on (1) a consistent classification of domain families, (2) efficient tools to detect both known and novel domains in every sequenced protein, and (3) a powerful human interface to access and comprehend the results of this automated analysis. Efforts towards a consistent classification of protein domains are underway with the construction of databases such as ProDom (Sonnhammer and Kahn, 1994, Corpet et al., 1999), PFAM (Sonnhammer et al., 1997) or DOMO (Gracy and Argos, 1998). Useful graphical interfaces are provided, for instance on the World Wide Web or with the XDOM program (Gouzy et al., 1997). Automatic domain detection algorithms have been developed, such as DOMAINER (Sonnhammer and Kahn, 1994), MKDOM (Gouzy et al., 1997) or DIVCLUS (Park and Teichmann, 1998). Here we use a new, more efficient domain clustering algorithm in order to systematically analyse domain families in fully sequenced available genomes.

Section snippets

MKDOM version 2

Recently Altschul et al. (1997) published the PSI-BLAST program which allows to elegantly recruit sets of homologous proteins using an iterative position-specific score matrix. PSI-BLAST cannot be used directly to build domain families in a fully automatic way, because multi-domain sequences yield heterogeneous sequence sets. However, if sequence databases are queried with a single domain, then PSI-BLAST should allow for the rapid generation of the corresponding homogeneous domain family. We

Whole genome domain analysis

In order to systematically analyse domain families and domain arrangements in various bacterial, archaeal and eukaryotic organisms, we applied the algorithm outlined above on a set of 38,440 sequences encompassing all known or predicted protein sequences encoded by 17 whole genomes. These include 12 bacteria (Aquifex aeolicus, Bacillus subtilis, Borrelia burgdorferi, Chlamydia trachomatis, Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Mycobacterium tuberculosis, Mycoplasma

Statistics of multi-domain proteins in various genomes

The analysis above allowed us to calculate the distribution of multi-domain proteins for each of the 17 available complete genomes (Fig. 2). In general, the number of proteins containing k or more domains decreases exponentially with k. There are, however, a few exceptions, the most notable being B. subtilis and M. tuberculosis, which show biphasic distributions with abnormally large sets of highly multi-domain proteins. Extreme multi-domain proteins in B. subtilis include polyketide synthase

Domain shuffling

The analysis above allowed to identify which domains are highly shuffled on the N-terminal side, on the C-terminal side or on both sides. We find among the most shuffled domains:

  • the TPR domain which neighbours 144 and 124 different domain types on the N-terminal and C-terminal sides, respectively; it is frequently tandemly repeated;

  • the ‘two-component’ receiver domain;

  • ferredoxin domains.

Some domains show a clear shuffling bias towards the N-terminal or the C-terminal side. For instance, the

Acknowledgements

We wish to thank Claude Chevalet for stimulating discussions. This work was supported in part by the Centre National de la Recherche Scientifique (Genome Initiative) and the European Union (Biotech BIO4-CT980052).

References (13)

There are more references available in the full text version of this article.

Cited by (58)

View all citing articles on Scopus
View full text