Dogs and mice, predator and prey, are now conjoined twins in the history of mammal genomics. As a pair, their sequence information, complete for mice and now 80% complete for the dog, are the paired standards for comparison to the complete human genome sequence. Analysis of this trinity is beginning to yield surprising results. We will review recent progress in the genome sequencing of the dog and discuss new discoveries based on its comparison to mouse and human. We also overview current efforts to complete the dog genome and the promise that this holds for comparative genomics and evolutionary studies.

The nearly complete dog genome sequence

The first step towards a better understanding of the dog genome was the development of a physical map (Guyon et al, 2003). The past few years have witnessed remarkable progress in this regard and a 3270 marker map with 900 kb resolution is currently available. However, the large number of dog chromosomes (2N=78) presents problems for comparative mapping with the human, mouse and rat, the only mammals whose genomes have been completely sequenced. There was therefore general rejoicing in the dog community with the publication of a first-generation dog sequence in the September 26th issue of Science by researchers at the Institute for Genomic Research (TIGR) and the Center for Advancement of Genomics (Kirkness et al, 2003). The sequence was essentially a 1.5 × coverage, meaning that about 80% of the genome is present in the resulting sequence reads. This is not quite enough to completely assemble a dog genome. In fact, the average size of joined overlapping fragments (a contig) was only about 1400 bases, which were further assembled into scaffolds with a mean length of 3.8 kb. However, as we discuss below, a remarkable finding of this report was that nearly as much was learned about the dog genome from this 1.5 × sequence as was gained from the 8 × sequence of the mouse. This suggests that, considering the reduced expense (40 million dollars per mammalian genome to generate a 6–7 × sequence vs 5–6 million dollars for a 1 × sequence), efforts to generate a 1 × sequence from a great variety of species can now be justified. In fact, over the next 4 years, US sequencing centers could produce 460 billion bases or enough for 192 genomes at 1 × coverage (Couzin, 2003).

The repetitive dog

The estimated size of the dog genome, at about 2.4 billion bases, is similar to that of the mouse but smaller than the 2.9 billion bases in the human genome. Of this total sequence, 31% is repetitive DNA as opposed to 46% in the human and 38% in the mouse. The most common repeat in dogs is a short interspersed nucleotide element (SINE) derived from a (tRNA)-Lys sequence with about 7% of the genome consisting of this single repeat class. Further, a subfamily of this repeat comprises 23 000 copies in the dog and had only 4.8% average divergence from the consensus sequence, suggesting a recent extensive expansion in the canine lineage.

In addition, a family of LINE repeat sequences (long interspersed nucleotide elements) called LIMA that are common in mammals have specific subfamilies in dogs that are not found in humans and mice. Similarly, dogs are missing LIMA subfamilies shared by the latter two species, reinforcing the observation based on DNA sequencing that the dog diverged before the lineage leading to human and mouse (Murphy et al, 2001). In fact, Kirkness et al used repeat information directly to reconstruct phylogenetic history and showed that the dog diverged first. Further, the authors compared repeat sequences represented by greater than 0.75 Mb in the mouse, human and dog, and confirmed prior reports of a more rapid rate of substitution in the mouse than human (1.6-fold). In contrast, the dog and human had similar rates of sequence substitution.

A surprising finding is that some SINE repeats are bimorphic in the sequenced dog (one allele copy with and the other without the SINE repeat). In all, 7% of one class of SINE repeats are bimorphic in the dog which is an order of magnitude more than that observed for human SINE repeats (>16 000 vs 1200 copies, respectively). Such polymorphic loci may be useful in reconstructing dog history, and in population genetic studies of wild canids as well as in gene mapping. As some of these insertions are in genes that can cause severe phenotypic effects such as canine narcolepsy (Lin et al 1999), Kirkness et al suggest that they may be an underlying cause of the large phenotypic diversity in dogs.

Pile-ups, gaps and COBs

Despite the high density of the mouse sequence and low density in dogs, most of the genes shared by human and mouse are recovered in comparison to the dog sequence with either species. For example, 96% of the genes shared between mouse and human are recovered in comparison to the dog, and in total, the dog sequence recovers slightly more human genes (18 473) than the mouse (18 311). The genes that did not align are likely to represent ancestral genes lost in the lineage leading to human and mouse. Critically, the genes that are shared between dog and human show much more homology; they demonstrate nearly twice the sequence similarity on average, as is observed between human and mouse genes. As Kirkness et al suggest, this likely reflects the more rapid rate of sequence evolution in the mouse compared to human and dog.

Multiple copies of some dog genes align to a single human gene. These are termed ‘pile-ups’ by Kirkness et al and reflect recent amplification events in the dog. In contrast, other human genes appear to have no homolog in the dog 1.5 × sequence and are called ‘gaps’. Comparative sequence analysis found 1355 and 513 genes to be pile-ups and gaps, respectively, and include some interesting genes. For example, similar to the mouse, the dog appears to have a larger complement of olfactory receptors than humans, reinforcing notions about our limited olfactory capabilities (Gilad et al, 2003). In contrast, the mouse has 140 vomeronasal receptors whereas the dog only has four. Some regions of the human genome containing 10 genes or more seemed to be missing in the dog, including clustered gene families such as pregnancy specific beta-1 glycoprotein and defensins. This may of course represent cloning artifacts in the material used for generating the 1.5 × sequence, and more sequencing will be needed to determine if various genes are really lost in the dog genome. If so, further analysis of lost and amplified genes in the dog may provide valuable insight into possible functional differences among the three species.

Another interesting class of homologs are defined by three-way alignments of sequences that define clusters of orthologous bases (COBs). As with the general comparison of genes in the three species, the dog shows more similarity to humans than to the mouse. Dog–mouse–human comparisons revealed a total of 371 774 COBs of average length of 456 bp or about 7% of the dog genome. Although COBs are more common in coding regions of genes, about 45% of COBs are found in the intergenic regions as well. Untranslated regions (UTRs) are enriched in COBs relative to introns and regions up or downstream of genes, suggesting that COBs in UTRs might sometimes be regulatory in nature.

The syntenic dog

Early investigations using reciprocal chromosome paint studies showed that dogs and humans shared large syntenic blocks of DNA and as many as 65–70 blocks initially were detected (Breen et al, 1999; Yang et al, 1999; Sargan et al, 2000). Radiation hybrid mapping expanded that number to over 85 (Guyon et al, 2003). Further, recently published work suggests that small rearrangements, deletions and insertions exist in the dog as they do in the mouse, although probably not to as large an extent (Guyon et al, in press). The new dog genome sequence confirmed 78 of 85 conserved blocks and in total, identified 159 segments of conserved synteny between dog and human that collectively spanned 2.2 billion base pairs. Consequently, as a system for studying comparative evolution of mammalian chromosomes, the dog rivals the mouse as a simpler system with fewer breaks and rearrangements.

The future dog

A proposal for a 6 × sequence of the dog was approved by the NIHGS last year and the Whitehead Institute at MIT won support for the sequencing work with an expected completion date of June 2004. This will provide a remarkable resource for comparison with the 1.5 × dog sequence as well as the human and mouse. For example, comparison of the 1.5 × sequence with humans revealed nearly a million single-nucleotide polymorphisms (SNPs) and about 150 000 putative di, tri- and tetranucleotide repeat polymorphisms. With the June completion of the 6 × sequence from a boxer and the now completed 1.5 × sequence from a standard poodle, a wealth of polymorphic markers will be discovered for use in association mapping studies. The dog may soon be the only species with two essentially complete genomes for comparison. Further, the Whitehead Institute has proposed sequencing an additional 600 million bases pairs from nine dog breeds and a gray wolf. This extensive comparative database will provide novel genetic markers for population genetic and evolutionary studies of domestic and wild canids, and may lead to a new understanding of the genes associated with domestication. To understand gene function and evolution, a focus on individual variation is the necessary next step in genome sequencing. The extensive phenotypic variation among breeds and the widespread abundance of its progenitor, the gray wolf, uniquely positions the dog as pack dominant in the race to find genes of consequential affect on morphology and behavior.