Main

The International Human Genome Sequencing Consortium (IHGSC) recently completed a sequence of the human genome and published a report on the finishing of the human genome2,3. Now, papers containing detailed reports about each human chromosome are bringing to light aspects of the biomedical and evolutionary implications of this work. Here we describe the completion of a physical map, high-quality finished sequence, and gene catalogue for human chromosome 18, which represents approximately 2.7% of the human genome.

The extremely low density of protein-coding genes on chromosome 18 (Table 1) offers an opportunity to study the conservation of non-protein-coding sequences. It was recently observed that, in addition to protein-coding sequences, 3% of the human genome shows a degree of evolutionary conservation among mammals that is significantly higher than background4. It is unclear whether this sequence consists mostly of regulatory elements related to genes or whether it represents other elements not tightly coupled to genes. These alternatives can be explored by comparing gene-rich and gene-poor chromosomes to see whether the proportion of conserved non-protein-coding sequence tends to scale with gene density or is unrelated to gene density.

Table 1 Chromosome 18 gene content

The finished sequence of chromosome 18 contains 76,117,153 bases and is interrupted by three euchromatic gaps, one gap at the 18q telomere and one gap containing the centromeric heterochromatin (Fig. 1 and Supplementary Table S2). These gaps are refractory to current cloning and mapping technology. The sizes of the euchromatic gaps were estimated by alignment to the regions of conserved synteny in the mouse genome4 (see Methods). The size of the telomeric gap was estimated using the size of the telomeric half-YAC (yeast artificial chromosome). The total size of these gaps is estimated at 118 kb. This corresponds to <0.2% of the euchromatic length of the chromosome, substantially lower than the average across the human genome (cited in ref. 3, also refs 5–7). Of the finished sequence, 79% was generated by the Broad Institute of MIT and Harvard (formerly the Whitehead Institute/MIT Center for Genome Research or WICGR), 20% by the RIKEN Genomic Sciences Center, and the remaining 1% by three other research groups (Supplementary Tables S3, S4). Details of construction of the clone map and sequencing are described in the Supplementary Information.

Figure 1: Overview of human chromosome 18.
figure 1

a, Blue shading indicates gene deserts (≥ 500 kb with no transcript, see Supplementary Table S8). Telomeres (pTEL and qTEL), the centromere (CEN) and euchromatic sequence gaps (red lines) are also indicated. b, G + C content in discrete windows of 100 kb. c, d, Densities of long interspersed nuclear elements (LINEs, red), short interspersed nuclear elements (SINEs, blue) (c) and transcripts (d) are shown as numbers of these elements in discrete windows of 100 kb. e, Blocks of conserved synteny (100-kb resolution) with dog, mouse and rat, determined for this work. Chromosomes are numbered, and are coloured arbitrarily for ease of distinction.

Several analyses verify that nearly the entire euchromatic region of chromosome 18 is present and accurately represented in the finished sequence. Of the 332 gene sequences in the well-curated RefSeq8 data set that have been mapped to chromosome 18, all are present and complete in the finished sequence. In addition, the finished sequence shows excellent alignment to genetic and radiation hybrid maps (Supplementary Fig. S1). The genetic map9 shows perfect alignment, with no discrepancies among 156 sequence-based genetic markers (Supplementary Table S5). The radiation hybrid map10 shows good agreement, but contains local discrepancies as would be expected from its lower resolution (Supplementary Table S6).

We assessed the local accuracy of the clone path by aligning paired-end sequences from a human Fosmid library (designated WIBR2, representing 10 × physical coverage) to the finished sequence3. By identifying discrepancies in the distances between Fosmid ends in the finished sequence and those expected on the basis of insert size constraints, one can detect errors in the clone path3. Our analysis revealed a single aberrant region, which was found to result from a bacterial artificial chromosome (BAC) clone containing a 21-kb deletion that was either present in the source genome or occurred in the cloning of the BAC; this clone was replaced with a non-deleted BAC from a different library. Finally, an independent quality assessment exercise commissioned by NHGRI estimated the accuracy of the finished sequence at less than one error per 100,000 bases11 (J. Schmutz, personal communication).

We produced a manually curated catalogue of genes (see Methods), annotating 337 gene loci and 171 pseudogene loci on chromosome 18. These include all previously known genes on chromosome 18 (Table 1). According to the Hawk2 categorization scheme (http://www.sanger.ac.uk/Info/workshops/hawk2, see Supplementary Information) there are 243 ‘known’ genes, 49 ‘novel CDS’ (coding sequence of a gene), 10 ‘novel transcripts’, 11 ‘putative genes’, 11 ‘predictedplus genes’ and 13 ‘gene fragments’. All ‘novel transcript’ genes had expressed-sequence-tag (EST) evidence. For ‘putative genes’, only a subset of the exons were supported by one or more spliced ESTs. Only a small fraction of all loci, those in the ‘novel’ and ‘putative’ categories, were annotated as genes on the basis of spliced EST evidence only. Some ‘gene fragment’ loci may prove to be pseudogenes.

Using aligned EST evidence, it was possible to extend many of the previously known gene models at their 5′ or 3′ ends (see Supplementary Fig. S2 for an example). Approximately 57% of the RefSeq and mammalian gene collection (MGC) transcripts could be extended. The 5′ end extensions averaged 321 bp, and 3′ end extensions averaged 1,131 bp. In addition, a novel 5′ exon was found for 14% of the RefSeq or MGC transcripts, and a novel 3′ exon was found for 2.2%. The ability to extend the gene models probably reflects expanded databases of transcripts and ESTs. A sampling of the extended gene models was validated in the laboratory (see Supplementary Information).

We found an average of 10.7 exons per full-length known transcript, comparable to recent published reports of human chromosomes. Internal exon lengths average 155 bp, and the average transcript length is 3.1 kb for full-length transcripts of known genes. There is evidence of extensive alternative splicing, with gene loci having an average of 3.1 distinct transcripts and 71% having at least two transcripts. This rate of alternative splicing is comparable to recent reports5,6.

The longest gene on chromosome 18 is DCC (deleted in colorectal carcinoma), spanning 1,190,632 bp. DCC also contains the longest intron at 411,177 bp. The longest mature transcript is laminin α3 (LAMA3) at 10,585 bp. The longest single exon is found in TCF4, being a 3′ exon of 5,700 bp. The gene with the most identified splice forms is TGIF (TGFβ-induced factor), which appears to have ten splice forms, of which two are represented by RefSeq transcripts. Of the 171 pseudogenes on chromosome 18, approximately two-thirds are processed (intronless) pseudogenes arising from retroposition, and the remaining one-third are unprocessed. In addition, we identified four transfer RNA genes on the chromosome, listed in Supplementary Table S7. An analysis of gene families revealed that several families have multiple members present on chromosome 18. These include members of the laminin and cadherin families of cell adhesion molecules, and a cluster of ten serpin serine protease inhibitors (see Supplementary Information). Careful analysis of gene models found 59 pairs of overlapping genes on chromosome 18, suggesting that overlapping genes may be 2–4 times more common than previously thought12,13 (see Supplementary Information).

With an average of 4.4 genes per megabase (Mb), chromosome 18 has the lowest gene density of published human chromosomes (Supplementary Table S1). This gene density cannot be explained by chance fluctuation around a genome-wide mean (P < 10-12, see Supplementary Information). The low gene density is reflected both in the low percentage of transcribed sequence (28.5%) and the small fraction of the chromosome included in exons (1.14% in all exons, 1.06% in coding exons). The G + C content (39.8%) is also low, consistent with the known positive correlation between G + C content and gene number14.

Chromosome 18 contains 24 gene deserts (defined as a 500-kb region without a coding gene, Supplementary Table S8), which together comprise 28 Mb or 38% of the total chromosome length. The sparsest region of the chromosome harbours only three genes over 4.5 Mb. In addition, chromosome 18 also has the longest median length of introns among all chromosomes, reflecting a genome-wide inverse correlation between intron size and gene density (Supplementary Fig. S3).

Despite being gene-poor, chromosome 18 is not enriched in repeat sequences. Transposable element fossils cover 43.5% of the chromosome, which is typical across the genome. Chromosome 18 also has a relatively low proportion of segmental duplication (segmental duplications are defined as having greater than 90% identity and being longer than 1 kb). Segmental duplications constitute 2.5% (1.92 Mb) of the chromosome, with a greater representation of interchromosomal duplications (2.13%) than intrachromosomal duplications (0.55%). Some sequences are represented in both types of duplication (E. Eichler and X. She, personal communication).

The paucity of genes on chromosome 18 probably explains why it is one of only three autosomes (the others being chromosomes 13 and 21) for which trisomic individuals routinely survive to term1 (www.trisomy.org, www.ndss.org). Although chromosomes 18 and 21 have roughly the same number of RefSeq genes (332 and 374 genes, respectively), chromosome 18 trisomy (Edwards syndrome) has much more severe health effects than chromosome 21 trisomy (Down syndrome). Edwards syndrome occurs in 1 in 5,000 live births, and 90% of affected individuals die before one year of age. In constrast, Down syndrome is more common (1 in 800 live births), and affected individuals are frequently able to cope with the numerous health consequences and survive to adulthood. The availability of gene catalogues for these two chromosomes will facilitate work to elucidate how the contributions of specific genes lead to such different clinical outcomes.

Four other syndromes are caused by gross abnormalities in chromosome 18, including three partial monosomies caused by deletion of part of the p or q arms (18p-, 18q- and ring18) and tetrasomy of the p arm (www.chromosome18.org). The gene catalogue presented here should facilitate identification of the critical genes associated with each syndrome.

At least 45 loci on chromosome 18 have been implicated in genetic disorders15 (Supplementary Table S9). The list includes at least four disorders for which the responsible gene and molecular mechanism of disease have been characterized (Supplementary Table S9). For two such diseases (methemoglobinaemia and erythropoietic protoporphyria), we found evidence for novel alternative splice forms that would result in coding sequence alterations (not shown).

Comparative gene analysis revealed one locus that may represent a newly evolved gene in the primate lineage, although its function is unknown. Among the annotated multi-exon genes contained in blocks of conserved synteny among mammals, only one lacks exonic conservation with rodents and dog: C18orf2, a predicted RefSeq gene. Within this block of conserved synteny there is a primate-specific 100-kb inversion in the region (present in both human and chimpanzee). One of the endpoints of this inversion lies in the middle of the coding region of the gene, with the result that the region is not contiguous in either dog or rodent genomes. Partial sequencing of this gene in apes suggests that it is conserved at least as far back as orangutan (see Supplementary Information).

We compared chromosome 18 to its homologue chimpanzee chromosome 18 (ref. 16). The average sequence divergence is 1.25%, which is close to the genome-wide average. On a larger scale, the karyotype of human chromosome 18 differs from its homologues in the great apes by a human-specific pericentric inversion with an associated human-specific inverted duplication of 19 kb (refs 17, 18). As a consequence, human 18p corresponds to the proximal region of chimpanzee 18q. As large-scale chromosomal rearrangements can facilitate speciation19,20, it is possible that this inversion had had a role in hominid evolution.

Finally, we sought to explore the still-mysterious nature of conserved non-protein-coding sequences. Recent comparison of the human and mouse genomes4 led to the surprising discovery that 5% of the human genome shows evolutionary conservation higher than the background rate (defined as the rate seen in ancestral repeat elements, which are presumed to be non-functional). Similar results have been seen in comparisons between the human and rat genomes21. As only 1–2% of the human genome encodes protein-coding exons, this indicates that the majority of human sequence under purifying selection is non-protein-coding. In principle, these non-protein-coding sequences could be (1) associated with protein-coding genes, such as those that directly or indirectly regulate the expression of protein-coding genes, or (2) independent of protein-coding genes, such as those that play a structural role in chromosome architecture or those that encode RNA genes.

We calculated the overall proportion of bases on each chromosome that are under purifying selection, and allocated this proportion as either protein-coding or non-protein-coding (see Methods). The computational analysis closely followed that used in recent mammalian comparisons4,22 (see Methods). We compared the proportion of total sequence under selection (Fig. 2a) and non-protein-coding sequence under selection (Fig. 2b) to the proportion of coding sequence for each human chromosome. Chromosome 18 contains a low overall proportion of sequence under selection, but this is almost entirely explained by its low coding density, as there is no deficit in non-protein-coding sequence under selection. Approximately 4.2% of the bases on chromosome 18 appear to be under purifying selection, consisting of 0.6% in exons of protein-coding genes and 3.6% in non-protein-coding elements. The proportion of non-protein-coding sequence under selection is typical for human chromosomes. (Note that chromosomes 19 and 22 are outliers in this analysis; the many local gene family expansions make it difficult to assign orthology.)

Figure 2: Scatter plots showing the fraction of syntenic region under selection plotted against the fraction of coding sequence in that region.
figure 2

a, By chromosome, the fraction of all sequence under selection versus the coding fraction. b, By chromosome, the fraction of all non-protein-coding sequence under selection versus the coding fraction. Numbers refer to specific chromosomes. c, The fraction of all sequence under selection within the region versus the coding fraction within the region. d, The fraction of all non-protein-coding sequence under selection versus the coding fraction. In c and d, each point represents a 5-Mb region from a set of non-overlapping 5-Mb regions covering the genome. Lines of regression are shown. We define non-protein-coding sequence as that which is completely disjoint from any predicted mature mRNA product of an annotated protein-coding gene.

As chromosomes vary widely in size, we repeated the analysis for 5-Mb windows across the human genome (Fig. 2c, d). Although there is more scatter in the data, the overall conclusion is very similar. Notably, the average proportion of non-protein-coding selected sequence in a window is 3.8%, and is slightly negatively correlated (R2 = 0.08) with the proportion of coding sequence in the window.

Our analysis shows that the density of conserved non-protein-coding sequences is largely independent of the density of protein-coding genes. It is interesting to note that examination of non-coding aligned sequences between human and chicken23 showed a negative correlation with coding content, and a study of highly conserved non-coding sequences in intergenic regions of human chromosome 21 did not identify tight coupling to the starts and ends of genes24,25.

What is the nature of the non-protein-coding elements? First, the elements might encode transcripts that are not translated into proteins, such as small RNA genes or large regulatory RNAs26. Second, they might serve a structural role, with a constant density of such elements required to maintain chromosome structure independent of gene density. Such structural elements could be evolutionarily essential for maintenance of a region, but might be dispensable if the entire region were to be deleted; this might explain the recent observation in mouse that a 1-Mb deletion in a gene desert containing highly conserved elements has no discernable phenotypic effect27. Third, the elements may be largely related to the regulation of protein-coding genes, but their distribution may be inversely correlated with gene density28,29. It is possible that genes in gene-poor regions tend to have more elaborate regulatory controls, and this could partially explain the relative sparsity of genes in such regions. In any case, it is clear that the finished sequence of the human genome will reveal many features of biological function and provide a firm foundation for future systematic analyses.

Methods

Generation of the gene catalogue

We started by aligning all available human RefSeq, MGC and GenBank messenger RNA sequences, as well as GenPept sequences from several species, to the finished sequence. Gene models were inspected manually to ensure accurate transcriptional start and stop sites, and to correct splice sites. Non-canonical splice sites were used only if supported by sufficient complementary DNA-based evidence. Partial transcripts (those containing a partial open reading frame (ORF) or overlapping non-coding exons of sibling transcripts) were annotated in cases for which there was firm evidence of their existence. Gene symbols for biologically characterized loci were assigned by the HUGO Gene Nomenclature Committee. See Supplementary Table S10 for a complete list of gene symbols. Our annotations are available from the Vertebrate Genome Annotation database (VEGA, http://vega.sanger.ac.uk/Homo_sapiens).

Comparative analysis: creation of synteny maps

We performed full genomic alignments of repeat masked sequence from mouse4 (builds 31 and 33), rat21 and dog (CanFam 1.0; K. Lindblad-Toh, personal communication) with the human genome sequence using the PatternHunter program30. We did this for human build 34 with the Broad finished chromosomes (8, 15, 17, 18) inserted, and also for human build 35 (mouse build 31 was used against human build 34, and mouse build 33 against human build 35). From these alignments we identified collinear clusters of conserved microsynteny, which were then used to form larger syntenic segments in a hierarchical fashion. Syntenic maps and their underlying syntenic anchors serve as the basis for identification of conserved elements.

Comparative analysis: identification of conserved elements

Starting with large-scale syntenic blocks defined by the human–mouse and human–dog syntenic maps, we generated pair-wise alignments within these syntenic blocks using the PatternHunter program30. We then scanned 50-bp windows with 5-bp offset and calculated the fraction of aligning bases that were matches (discarding windows with fewer than 20 aligning bases). These percentage conservation values were locally normalized to the average conservation in the surrounding 5 Mb to generate Z-scores measuring divergence from the local average (0) for every window. We examined the joint empirical distribution of mouse and dog Z-scores for windows contained within ancestral repeat sequence (undergoing neutral evolution and believed to predate the mouse–human split) and windows overlapping coding exons (Supplementary Fig. S4a). Coding sequence is defined as all bases that are annotated as coding in any transcript. All analysis presented uses Ensembl31 genes on human build 35; analysis with both Ensembl and Broad annotations on build 34 yields substantially similar results (Supplementary Information).

We combined dog and mouse Z-scores to generate a ‘composite’ Z-score (see Supplementary Information). We estimated the distribution of composite Z-scores for selected sequence by decomposing the global distribution of Z-scores into two components: a ‘neutral distribution’ centred at zero and corresponding to the conservation scores for ancestral repeat sequences, and a ‘selected distribution’ consisting of the residual after subtraction of the neutral distribution (Supplementary Fig. S4b). Taking into account the relative fractions of the aligning windows in each distribution, we were able to assign a probability that a window at a given score is under purifying selection.

We then divided the genome into non-overlapping 5-Mb windows. Within each such window, we counted the number of syntenic bases, the number of syntenic 50-bp windows, and the number of 50-bp windows under selection. The fraction of coding sequence (the explanatory variable in all regressions) was taken as the number of syntenic bases annotated as coding divided by the number of syntenic bases. The fraction under selection was calculated as the sum of all selection probabilities for all windows divided by the number of syntenic windows. If windows of only a certain class were considered, the probabilities were calculated only for windows in that class. We note that, on average, windows contained within coding exons scored only slightly higher than 0.67 probability of selection, owing to the large prior probability of neutrality. Thus, the slopes of all regressions are <1. For all analyses, we discarded any 5-Mb window with less than 4 Mb of syntenically assigned sequence (retaining >85% of all windows of non-zero euchromatic length). Similar results are obtained if the discarded windows are included, but the variance is higher.

Annotation

RefSeq (release 1), mammalian gene collection (MGC, 3 February 2003), dbEST and GenBank (29 December 2002) mRNAs were aligned to the genomic assembly using BLAT32. GenPept protein sequences (3 February 2003) were aligned using BLASTX33 and GeneWise34. All gene models were created manually using these aligned sequences as evidence, following HAWK2 (www.sanger.ac.uk/Info/workshops/hawk2) transcript type conventions. Gene models derived from aligned mRNA evidence were extended when possible using spliced EST evidence at the 5′ end and spliced and unspliced EST evidence in the 3′ untranslated region (UTR). Evidence was given relative priority as follows (high–low): RefSeq/MGC, GeneWise, other mRNAs, spliced ESTs and unspliced ESTs. We found CpG islands within 2-kb upstream and 1-kb downstream of the 5′ end of 73% of known category loci, which is somewhat higher than previous reports (in the range of 61–66%; cited in ref. 3, also refs 5–7).