Main

Nematodes, or roundworms, are a highly diverse group of organisms1. What nematodes lack in obvious morphological disparity, they make up for in abundance, accounting for 80% of all individual animals on earth2, and diversity, with estimates ranging from 100,000 to 1 million extant species3. They exploit a wide variety of niches and include free-living terrestrial and marine microbivores, meiofaunal predators, herbivores, and plant and animal parasites. On the basis of small subunit ribosomal RNA (SSU rRNA) phylogenetics1,4, nematodes can be divided into three major clades: Dorylaimia (clade I)1,4, Enoplia (clade II) and Chromadorea (which includes Rhabditida, also known as Secernentea). Rhabditida can be further divided into Spirurina (clade III), Tylenchina (clade IV) and Rhabditina (clade V; Fig. 1). Parasitism of both animals and plants seems to have arisen multiple times during nematode evolution, and all major clades include parasites.

Figure 1: EST data sets from across the phylum Nematoda.
figure 1

(a) Species are grouped into major taxonomic groups based on SSU rRNA phylogeny1,4. This differs from 'traditional' phylogenies but is consistent with current morphological and developmental evidence. The trophic biology of each targeted species is indicated by a small icon. (b) The proportion of each species' partial genome that has significant similarity (a match with a raw BLASTX score ≥50) to the complete proteome of C. elegans. Owing to the difference in criteria used to define significant similarity, these numbers differ slightly from those previously reported17,18.

Most nematode diseases are intractable problems. Infections of humans by nematodes result in substantial human mortality and morbidity, especially in tropical regions of Africa, Asia and the Americas: 2.9 billion people are infected. Morbidity from nematodes is substantial and rivals diabetes and lung cancer in worldwide disability adjusted life year measurements5. Although mortality is low in proportion to the huge number of infections, deaths may still total 100,000 annually. The most important parasites include hookworms, Ascaris and whipworms (>1 billion infections each) and the filarial nematodes that cause elephantiasis and African river blindness (120 million infections). Parasitic nematodes also cause substantial losses in livestock and companion animals and are responsible for $80 billion in annual crop damage worldwide6.

Much of what we know about the molecular and developmental biology of nematodes stems from the study of the free-living soil rhabditine nematode Caenorhabditis elegans (Fig. 1). C. elegans is a versatile and tractable model organism, contributing substantially to understanding of important medical fields including cancer, ageing, neurobiology and parasitic diseases7,8,9. C. elegans was the first multicellular organism whose genome was completely assembled7. Despite the wealth of information available for C. elegans, and its sister species Caenorhabditis briggsae10, comparatively little is known about other members of this important phylum.

Two projects were initiated to generate new sequence data for nematode parasites spanning the phylogenetic disparity of the phylum11. We used expressed-sequence tags (ESTs; sequences derived from randomly selected cDNA clones) as they are a cost-effective route to gene discovery12. We generated 265,494 sequences from 30 different species of nematode, the largest collection of ESTs representing the full diversity of a single phylum. In addition to identifying traits that may be species- or phylum-specific, this collection offers an unparalleled opportunity to explore and elucidate evolutionary and functional relationships. Here we present an overview of the sequence data arising from the parasitic nematode EST project and place them in the context of C. elegans genomic biology.

Results

265,494 ESTs from nematodes other than Caenorhabditis

Nematode EST projects have generated more than 250,000 ESTs from 30 target species (Fig. 1 and Table 1 online; refs. 11,1320 and J.P.M. and M.L.B., unpublished data). For each species, we grouped ESTs into clusters and predicted consensus sequences for each cluster (putative gene). These sequences together form the 'partial genome' of each species. Figure 2a shows the level of redundancy (ESTs per gene) associated with each partial genome. We observed diminishing returns, in terms of new gene discovery, as we sequenced more ESTs from one species. Redundancy was greatest for Ascaris suum, the most heavily sampled species. The number of genes per species ranged from 208 for the smallest EST set (Zeldia punctata, 388 ESTs) to more than 9,500 (Brugia malayi, 25,067 ESTs). We defined 93,645 putative genes. This is probably a slight overestimate, as the clustering process may split some allelic variation into distinct genes (most parasitic nematode populations used were outbred), different splice forms may not have been clustered together, and nonoverlapping ESTs derived from the same mRNA will not have clustered. This inflation is probably minor (5%), based on comparisons to the complete C. elegans proteome17,18 and previous analyses of subsets of these data18,19. If, as seems likely, most nematodes have 20,000 protein-coding genes7,10,21, we have tags for 1–50% of the expected gene complement for each species, with a mean of 16%. The total number of putative genes triples the number of nematode genes defined22,23.

Table 1 Summary information of sequence data derived from 30 different species of nematodes
Figure 2: Gene discovery in nematode EST data sets.
figure 2

(a) Gene discovery rates in nematode EST data sets. This graph shows the relationship between the number of ESTs sequenced and the number of genes discovered. Each point represents an individual organism's data set. See Table 1 for details. (b) Exploration of genespace in the phylum Nematoda. The cumulative number of different genes (those that have no significant similarity to any other gene) in the EST and proteome data sets from the phylum Nematoda. Each point represents the addition of one nematode species. The first point represents the 22,000 C. elegans proteins. As each partial genome data set was added, increasing the total number of genes (x axis), the number of different genes (y axis) increased. There was no apparent fall-off in the rate of discovery of new genes, suggesting that nematode genespace may be very large. The colors indicate the systematic origin of each species group (see Fig. 1).

Genomic disparity across the phylum Nematoda

The different genes found in the genome(s) of an organism or group of organisms can be thought of as occupying a 'genespace'. More complex genomes occupy a larger genespace, in general, and larger groups of organisms (e.g., phyla) have a genespace that is the union of the constituent species' genespaces. Analysis of bacterial genespace from complete genomes showed that sequencing additional eubacterial genomes has yielded diminishing returns in terms of novelty24. If nematode genespace is similarly limited, the fact that the genomes of C. elegans and C. briggsae have been completely sequenced means that sampling additional genomes will result in a low rate of gene discovery as additional genomes are sampled. We carried out an exhaustive series of cross-species BLAST analyses to estimate the extent of nematode genes. We found that 30–70% of the genes from each species had no non-nematode homolog (Table 1). The partial nature of the consensus sequences derived from the ESTs may preclude finding matches: short sequences will not reach our score cutoff, and some consensuses may cover only 3′ untranslated regions. Of the 64,685 cluster consensuses longer than 400 bp, 29,118 (45%) had no significant match to non-nematode sequences; for consensuses less than 400 bp in length, 78% seemed to be new. Thus, even excluding short sequences, nearly half the predicted genes seem to be new. The rate of discovery of genetic novelty has not yet started to decline with the analysis of new genomes (Fig. 2b), implying that nematode genespace may be much larger than bacterial genespace.

Of the 93,645 putative genes identified in this study, 14,630 (15%) had significant sequence similarity to putative genes in all of the five major nematode taxonomic groups (i.e., homologs of these genes were identified in all major clades; Table 1). Most of these (13,368; 91%) also had homologs outside Nematoda and therefore are probably involved in core metabolic or structural pathways. We found that 1,262 genes are nematode-specific but widely represented within the phylum. These genes may have roles unique to the nematode body plan and life history and are good targets for pan-nematode control drugs.

These findings raise an important issue: if organisms from the same major taxonomic group share only 60% of their genes, then individual taxa may have widely divergent biology. Within the genus Caenorhabditis, C. elegans and C. briggsae share 90% of their genes at the level of discrimination used here10. Our survey suggests that this level of genetic novelty may be universal across the Nematoda. Lineage-specific genes could have completely new functions, could have similar functions to genes in other organisms but use a completely different mechanism (analogous genes), or could have similar features, such as tertiary structure, that enable them to use similar molecular mechanisms.

Genomic conservation across the phylum Nematoda

Since the last common ancestor of Nematoda, 750–650 million years ago25,26, nematodes have evolved to occupy many niches. The success of nematodes today reflects the expression of successful complexes of genetic traits, some of which are derived from the common ancestor. A contrasting tendency towards evolution of new genetic function underpinning particular fitness advantage in particular habitats will have resulted in divergence in gene complement. We examined the patterns of gene gain, retention and loss across the phylum by comparing partial genomes both within and between major clades.

The complete genome of C. elegans yields a predicted proteome of more than 22,000 polypeptides, some of which derive from alternative splicing and more than 75% of which have some experimental verification27,28. We carried out extensive comparisons of the 93,645 new nematode genes with this data set; comparisons with the genome of C. briggsae10 yielded similar results (data not shown). For each C. elegans chromosome, we examined the patterns of gene density and the density of genes with known RNA interference (RNAi) phenotypes (Fig. 3). As described previously, each autosome has a greater density of genes in the autosomal centers7,29, and RNAi phenotypes cluster to the centers28,30. The autosomal arms are enriched in C. elegans–specific genes, in tandemly duplicated gene families and in repetitive and transposable elements7. The sex chromosome (X) has a more even distribution of genes and RNAi phenotypes along its length.

Figure 3: Chromosomal location of C. elegans homologs of other nematode genes.
figure 3

Each panel represents a different C. elegans chromosome (autosomes I–V and the X sex chromosome). Track 1: the average gene density per 100-kb division (brightest red, >40 genes per 100 kb; blue, <10 genes per 100 kb). Track 2: the relative abundance of genes with visible RNAi phenotypes (red, highest abundance; blue, no genes represented). Tracks 3–7: the abundance of C. elegans genes with homologs in the pooled partial genome data sets of the five major clades of the Nematoda. Tracks 8–10: the abundance of C. elegans genes with homologs in the complete partial genome, in tissue (intestine) and in stage-specific (embryo) subsets from A. suum. Track 11: the positions of individual C. elegans genes that have homologs in all five major clades. Track 12: the positions of individual C. elegans genes that have homologs in all five major nematode clades but are not significantly similar to any non-nematode gene (a subset of the genes plotted in track 11). The number of genes contributing to each track plot is given in brackets.

We compared the nematode partial genomes to the C. elegans proteome in this genomic context (Fig. 3). Although we cannot make definitive statements concerning the absence of homologs, matches, when summed over major clades, should identify overall themes. Chromosomal centers were enriched, relative to arms, in the density of similarity matches to all the major taxonomic groups. The pattern of match density faithfully reflected the distribution of genes with RNAi phenotypes28,30 rather than overall gene density. Thus, taking chromosome (chr) II as an example, each of the five major clades had a high density of matches from 4 Mb to 12 Mb (the chromosomal center), with additional peaks of RNAi phenotype genes at 0–0.5 Mb, 3.5 Mb, 13 Mb and 14–15 Mb mirrored by matches in the nematode partial genomes (Fig. 3).

There were additional regions of high density of matches that did not coincide with RNAi phenotype genes. On chr II, we observed high match densities in all major clades at 1.0–1.2 Mb and 13.6 Mb, where RNAi phenotype genes were rare. We noted similar regions on chr I (1.2–2.0 Mb), chr III (0.8–1.0 Mb), chr IV (1.8 Mb) and chr V (1 Mb and 20 Mb; Fig. 3). On the X chromosome there was a cluster of conserved genes at 17–18.5 Mb that did not have a high frequency of RNAi phenotypes. These conserved genes may not yield RNAi phenotypes in C. elegans because of gene family redundancy, or because they are involved in nematode-specific phenotypes that are not assessed by currently applied assays under select laboratory conditions. Conversely, there were a few chromosomal regions where a high density of RNAi phenotype genes was not matched by a high density of matches to the partial genomes: one example is on chr II at 2.8 Mb (Fig. 3).

Comparison between autosomes showed that chr V had 60% fewer matches per C. elegans gene than the other autosomes. This adds to the known peculiarities of chr V (ref. 31), which also has more C. elegans–specific genes, has fewer RNAi phenotypes per gene, and is significantly less likely to match genome survey sequences from the filarial nematode B. malayi21.

Mapping genomic divergence within the phylum Nematoda

C. elegans is a rhabditine nematode1, closely related to the Diplogasteromorpha (represented here by Pristionchus pacificus) and the vertebrate-parasitic Strongyloidea (represented here by seven species; Fig. 1 and Table 1)4. The proportion of the partial genome of each sampled species that contained C. elegans homologs roughly followed the species' relationship to C. elegans, with rhabditine species having the highest mean proportion of homologs (Figs. 1 and 4). Comparative analyses of the full partial genome data sets yield a set of similarity relationships congruent with the SSU rRNA phylogeny4. We used SimiTri analyses, which plot the relative similarity of entire gene data sets against three data sets of interest32, to examine these relationships further. Comparison of the partial genomes of the hookworm Ancylostoma caninum and Haemonchus contortus (another strongyloid parasite), P. pacificus and C. elegans (Fig. 4a) showed that although genes with matches to C. elegans comprised the largest subset, most genes were more similar to H. contortus than to P. pacificus. A number of genes had high-scoring matches in P. pacificus and H. contortus but were lacking from the complete C. elegans proteome, probably indicative of gene loss in the C. elegans lineage. Extending these analyses to span the major nematode clades showed the expected global trend for A. caninum genes to be most similar to homologs from other rhabditids (including other species from Strongyloidea; Fig. 4b) and least similar to homologs from the dorylaims Trichinella spiralis and Trichuris muris (Fig. 4c)19.

Figure 4: Comparing partial genomes across the Nematoda.
figure 4

SimiTri plots provide a two-dimensional representation of the degree of similarity of an entire data set of sequences with those of three different organisms32. The plots allow estimation of relationships of whole sequence data sets and highlight genes with patterns of conservation that differ from the main trend in a data set (see Supplementary Methods for a more detailed description). (a) A. caninum compared with H. contortus, P. pacificus and C. elegans. This plot shows that although A. caninum has more matches to C. elegans (because its complete genome is available), overall, the sampled transcriptome from A. caninum is closer to that of H. contortus. (b) A. caninum compared with other rhabditine nematodes (excluding C. elegans), nematodes from other clades and C. elegans. (c) All partial genomes from rhabditid nematodes (excluding C. elegans species) compared to dorylaim, spirurine and tylenchine nematode partial genomes. The numbers at each vertex indicate the number of genes matching only that database. The numbers on the edges indicate the number of genes matching the two linked databases. The number in each triangle indicates the number of genes with matches to all three databases.

Tylenchomorpha, Cephalobomorpha and Panagrolaimomorpha species also had a high proportion of matches to rhabditine sequences, but as the predicted phylogenetic distance from C. elegans increased, there was a corresponding decrease in the number of genes sharing significant similarity (Fig. 1). The cephalobomorph Z. punctata seems to be an exception, but this anomaly is probably due to sampling from a single spliced leader-PCR–based library that biases towards short, conserved transcripts16. Only 45% of the genes from the dorylaims T. muris and T. spiralis had significant similarity to genes from C. elegans, but a similar percentage of their genes shared similarity with Drosophila melanogaster (data not shown). Thus T. spiralis and T. muris may be good choices for deeper genomic analysis of the relationships of Nematoda to other metazoan phyla. Overall, these results suggest that C. elegans will be an effective genomic model for other rhabditid nematodes, but that accumulated differences will make extrapolation to distantly related nematodes, such as dorylaims, more challenging. There are many nematode genes, found in species across the phylum, that are absent from C. elegans. Gene loss has therefore been an important part of C. elegans genome evolution, as was suggested by the finding that C. elegans' depleted HOX gene complement is a result of lineage-specific losses33.

It has previously been suggested that there is a high-level ordering of genes within C. elegans chromosomes, with, for example, muscle-expressed genes being located in close proximity to each other more often than would be expected by chance34, and RNAi phenotyping suggesting that large chromosomal domains of genes have similar biological roles28. We investigated whether nematode-specific or stage-specific genes were clustered at a megabase level but did not find evidence for linkage of these classes of genes. For example, we mapped homologs of A. suum genes that had stage- or tissue-specific expression patterns to the C. elegans genome (Fig. 3). The tissue-specific (intestine) or stage-specific (embryo, L3, L4) genes showed the same general pattern of distribution as all A. suum genes.

Nematode-specific genes and gene families

Putative nematode-specific targets for drugs with diminished risk of toxicity to hosts or other nontarget organisms may be found in the class of nematode-specific genes. We found that 30–50% of each of our chosen species' partial genomes was unique. Of the 52,267 genes for which no homolog was identified outside the phylum (Table 1), 21,640 had significant similarity with a sequence from another nematode species. Mapping these nematode-specific genes onto the phylogeny showed an incremental evolution of novelty (Fig. 5). Most unique genes were associated with shallow-level taxonomic groups, but a considerable proportion had a deeper origin. Some deep splits in Nematoda were associated with few unique genes (e.g., Panagrolaimomorpha plus Tylenchomorpha/Cephalobomorpha has only 198), perhaps reflecting relatively rapid divergence of these daughter clades shortly after the origin of the ancestral tylenchine.

Figure 5: Evolutionary origins of unique genes and gene families in the phylum Nematoda.
figure 5

The inferred positions of origin for the nematode-specific genes and gene families were mapped across the robust SSU rRNA phylogeny1,4. For each node, the upper number shows the number of genes unique to each clade, and the lower number shows the predicted number of unique gene families (with the number of individual predicted genes included in these families in brackets). In the absence of complete genome sequences from most of the Nematoda, this mapping places the origin of each gene or gene family at the highest possible node: adding complete genome sequences will tend to move the node of origin of some genes lower in the tree.

We clustered the predicted polypeptides associated with these nematode-specific genes using Tribe-MCL35 into putative gene families and mapped the latest possible origin of these families onto the phylogeny (Fig. 5). In general, the number of new gene families at each node of the tree reflected the number of genes associated with the smallest daughter clade. The two largest groups of unique gene families occur at the base of Rhabditida and at the node connecting Spirurina with Tylenchina plus Rhabditina. This possibly reflects both the relatively large number of taxa and the number of ESTs generated for these three clades (e.g., the Tylenchina data set included several closely related Meloidogyne species17). Most unique gene–origin events seemed to occur relatively early in the nematode lineage. For example, more than 6,500 genes had homologs in each of the three taxonomic groups in the Rhabditida, and these included 4,330 genes in 1,262 nematode-specific families (Fig. 5).

We examined some nematode-specific gene families in more detail. For many we identified C. elegans members, permitting exploration of possible function through published RNAi data28. For example, a family with ten members from all major clades in our data set had a single C. elegans homolog identified by RNAi to be essential for postembryonic larval development (Supplementary Fig. 1 online). Another was limited to Rhabditida and had a C. elegans member with an RNAi phenotype of inhibition of postembryonic growth (Supplementary Fig. 1 online). In both cases, the degree of sequence conservation suggests that the RNAi function may be ascribed to the other genes: the C. elegans phenotype recommends these for further investigation as targets for nematicides.

Domain and functional analysis of nematode proteins

We used InterPro36 to identify known protein domains in the partial genomes. Because the C. elegans proteome has been extensively investigated for protein domains7,37, many domains of unknown function have been defined that are exclusive to C. elegans and C. briggsae. Many of the matches we discovered were to these nematode-restricted domains. For each species, 30–50% of the polypeptides were predicted to contain at least one previously identified domain (Supplementary Table 1 online). Fewer polypeptides from both spirurine and dorylaim species than from tylenchine and strongyloid species contained a domain, reflecting the C. elegans bias in InterPro. The number of unique domains associated with each species increased with size of its partial genome (Supplementary Table 1 online).

Comparison of the most abundant domains associated with each clade showed that, with the exception of the protein kinase domain, the abundant domains did not correlate well with those previously identified in the complete proteomes of C. elegans and C. briggsae (Supplementary Table 2 online)7,10. These differences may have arisen from the unavoidable bias in the types of genes sampled by ESTs. We minimized this bias by grouping the partial genomes and their domain contents by major clade (Supplementary Table 2 online). Cuticle collagens (IPR008160 and IPR002048) are abundant in C. elegans and C. briggsae (170 in each) but were poorly represented in the dorylaim partial genomes, possibly reflecting the derivation of these data sets from nonmoulting stages. Collagens have a temporally restricted expression pattern in C. elegans, with most expression in larval stages38. The strongyloid partial genomes were enriched for peptidases (IPR000169, IPR001254, IPR001353 and IPR00668), and dorylaim sequences were enriched for potential proteinase inhibitors (IPR008197 and IPR008198) and for chymotrypsin (IPR001254). This may reflect the parasitic niche of the sampled species (the host intestine), where peptidases and inhibitors may be required for feeding and survival in such a hostile environment. Also in Strongyloidea, the ShK metridin-like toxin domain (IPR003582) was highly represented, perhaps reflecting involvement in parasitic interactions15. EGF-like domains (IPR006209) are one of the more common domains in C. elegans and C. briggsae7,10. Although EGF-like domains were found in other Rhabditina, dorylaim and panagrolaimomorph organisms had the highest relative abundance. EGF-like domains are associated with membrane-bound or secreted proteins involved in signaling and recognition and may therefore be involved in manipulation of host responses.

Analysis of InterPro domain representation in nematode-specific genes identified a set of domains that may be important in parasitism. In the Strongyloidea, the thirteenth-most abundant InterPro domain is the allergen V5/Tpx-1 related domain IPR001283, found in many secreted proteins, most notably in the Ancylostoma secreted proteins (asp) that have immunomodulatory activity39. Additional asp-like proteins are present in other clades15. Although IPR001283 is found in organisms other than nematodes, genes containing this domain have undergone lineage-specific amplification and divergence in Strongyloidea. An abundant domain found in nematode-specific gene families of particular prominence is the 'transthyretin-like' IPR001534: this family had 394 members from all species (only lacking in Nippostrongylus brasiliensis), of which 377 had no significant BLAST similarity outside the Nematoda. It is prominent in C. elegans and C. briggsae. Mammalian transthyretins transport thyroid hormones, and many of the nematode genes have secretory leader polypeptides, suggestive of a role in hormonal signaling in nematodes also.

To compare the biological functions of the genes associated with each nematode, we used InterPro matches to assign Gene Ontology terms40. The high-level Gene Ontology profiles for each clade were very similar (Fig. 6), but we noted some differences between clades. Dorylaim and panagrolaimomorph nematodes had a lower incidence of structural proteins (Fig. 6a). Rhabditine nematodes had an elevated number of predicted extracellular proteins, and pirurine, tylenchine and rhabditine nematodes had an increased proportion of ribonuclear proteins (Fig. 6c)20. The two groups with the highest proportion of nuclear-localized predicted polypeptides were Tylenchina and Dorylaimia. In both, parasite-secreted proteins are known to localize to the host nucleus (Fig. 6c).

Figure 6: Functional annotation of genes using Gene Ontology terms.
figure 6

Each sequence was compared with the InterPro database of domains and these matches were used to assign high-level Gene Ontology terms. The data is summarized by major clade (see Fig. 1). The x axes show the percentage of all the Gene Ontology terms for each assignment: (a) 'molecular function' assignments; (b) 'biological process' assignments; and (c) 'cellular component' assignments.

Metabolic pathway analyses

There is a general perception that parasites have lost function (undergone reductive evolution) as they came to rely on the metabolic capacity and homeostatic buffering of their hosts. But many parasitic nematodes spend part of their life cycle outside any host, or have multiple phylogenetically and metabolically different hosts, and therefore may experience evolutionary pressure to maintain or even expand metabolic and regulatory functions41. We compared the partial genome of each species with the KEGG database42 to determine the extent of metabolic pathway representation (Supplementary Table 3 online). For most pathways, the number of enzymes associated with each major clade correlated with the number of sequences generated (Table 1 and Supplementary Table 3 online). The general congruence between the major clades suggested that many pathways are conserved within the nematodes despite their diversity. Some differences were noted, however. Spiruria and Dorylaimia had 17 enzymes (34 clusters) from fatty acid biosynthesis pathway 1 (using acyl carrier protein-bound precursors) but lacked pathway-2 enzymes (using coenzyme A-bound precursors) completely, whereas Tylenchina had only pathway-2 enzymes (44 clusters mapping to three enzyme types), and Rhabditina had both. No valine or methionine biosynthesis enzymes were identified in the animal-parasitic Spiruria, suggesting that these may be essential amino acids. N-glycan degradation enzymes were notably abundant in Tylenchina, but less evident elsewhere. As the complete genome of C. elegans encodes many N-glycan degradation enzymes, this suggests that this pathway is particularly highly expressed in these plant parasites. Enzymes involved in inositol metabolism were also prominent in Tylenchina (and in the complete C. elegans proteome) but absent in other sampled species. These predictions of taxonomically restricted biochemical pathways may serve to direct drug target definition.

Discussion

We used 250,000 ESTs to predict more than 90,000 genes for a suite of important human, animal and plant parasitic, and two free-living, nematodes. Comparison of each species' partial genome with the complete genomes of C. briggsae and C. elegans and with genome data from other phyla identified a spectrum of genes and gene families, some of which were deeply conserved, others were pan-nematode but nematode-unique, and others were taxonomically restricted. This data set aids the annotation of the C. elegans genome in confirming gene predictions and identification by alignment of homologs of conserved and thus functionally important residues. Highly conserved genes discovered in species across the phylum may have important function in C. elegans, but some such genes currently have no known RNAi phenotypes, perhaps showing the limitation of on-plate assays. We identified tens of thousands of potential targets for drug and vaccine development. Many of these are nematode-specific but conserved across the phylum, offering the prospect of new pan-nematode treatments. We have deposited our sequence data in public databases as it is generated and offered our analyses openly over the Internet since the inception of the project, and so many of the genes we identified have already been selected by other researchers in parasitology and C. elegans biology for further study.

We look forward to further expansions of the nematode genome data sets. We are still sequencing additional ESTs from target species, and other projects, including enoplid and chromadorid taxa, are also underway or planned. The genus Caenorhabditis will soon have five nearly complete genome sequences, and the B. malayi genome is nearing completion at The Institute for Genomic Research43. Genome sequencing is planned for H. contortus, Meloidogyne hapla, P. pacificus and T. spiralis. Our survey indicates that model species cannot show the genetic and genomic diversity of even their own phylum, and that continuing, phylogenetically informed genome sequencing is essential for advances in genomics, evolution and infectious disease biology.

Methods

Large-scale EST generation across the phylum Nematoda.

We selected a portfolio of nematode species based on criteria of phylogenetic spread, availability of material and health, economic or scientific importance, after consultation with the research and funding communities. The number of ESTs sequenced for each species varied because of factors such as availability of material, quality and source of libraries, and perceived importance of the organism. For each species, we aimed to accumulate an EST data set spanning multiple life cycle stage–specific and, where possible, tissue-specific cDNA libraries. Previous experience with the filarial parasite B. malayi showed that sampling from throughout the nematode life cycle was essential to maximize rates of gene discovery44. To this end, we constructed 172 different cDNA libraries, using various cDNA synthesis technologies and vectors (Supplementary Methods online). Members of the research community were generous in providing both biological materials and libraries.

Sequencing and data processing.

Sequencing and EST processing at the Genome Sequencing Center was carried out as described17,18,45. Sequencing and processing at the Wellcome Trust Sanger Institute was carried out as described46. Before submitting them to dbEST, we processed the sequences to assess quality, trim vector, remove contaminants and cloning artifacts, and identify BLAST similarities using Genome Sequencing Center pipelines45 and the trace2dbEST pipeline47. All ESTs have been submitted to dbEST12.

Sequence clustering, annotation and database creation.

For each species, we downloaded sequences from dbEST in May 2003 and parsed them through PartiGene, a software pipeline designed to analyze and organize EST data sets47. We first checked sequences for contaminating vector sequence and trimmed poly(dA) tails. We then clustered sequences into groups (putative genes) on the basis of sequence similarity using CLOBB48. We assembled clusters to yield consensus sequences using PHRAP (P. Green, unpublished data). We then subjected each consensus sequence to a series of BLAST analyses49 against a suite of protein and nucleotide databases derived from public databases (see Supplementary Methods online for details) and the thirty sets of consensus sequences (partial genomes) for each nematode species analyzed here. We defined significant matches as those having a raw BLAST score ≥50 (this corresponds to an expect value of 10−5 to 10−6, depending on the size and composition of the databases). Although this cutoff may miss some homologous matches, it is sufficiently inclusive to identify most domain matches. Results were processed and stored in a local installation of a PostgreSQL database22,23.

Predicted proteome analysis: Gene Ontology, domains and metabolic pathways.

For each consensus sequence, we obtained polypeptide predictions using the prot4EST package in the PartiGene pipeline47. Predicted polypeptide sequences were compared to InterPro (data version 7.0) to identify functional domains using InterProScan36. An InterPro annotation was assigned to each domain and translated into Gene Ontology40 codes. These results were parsed into a local installation of AmiGO40 from which broader functional categories were derived. Protein families were identified using TRIBE-MCL35 using default parameters.

To map predicted polypeptides to metabolic pathways, we compared them using BLASTP to the KEGG database (version 29)42. We retained each match meeting a cut-off of an expect statistic ≤1 × 10−10. When one cluster matched several closely related enzymes, we considered the top match and all the matches within a range of 30% of the top score.

URLs.

See http://www.nematode.net/ for more information on Genome Sequencing Center trace files and clone ordering, and see http://www.nematodes.org/ for more information on Wellcome Trust Sanger Institute trace files and clone ordering. PHRAP is available at http://www.phrap.org.

Note: Supplementary information is available on the Nature Genetics website.