Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Comparative genomics: genome-wide analysis in metazoan eukaryotes

Key Points

  • With the availability of genome sequences for an increasing number of metazoan organisms, the data are now present to carry out large-scale comparative genomic studies.

  • The size of sequenced metazoan genomes, which range from 100 Mb to 3 Gb, makes their comparison a real challenge. Several new approaches have been developed to solve some of the associated problems.

  • The alignment of whole genomes extends the genoome regions that are available to analyse evolutionary mechanisms such as neutral, negative and positive selection, and the history of large insertions and deletions.

  • The potential to compare closely, and even less closely, evolutionarily related metazoans provides new opportunities to identify conserved functional sequences, such as genes or regulatory regions, that are not easily predictable by conventional approaches on a single genome.

  • Hidden Markov model-based programs have been developed mainly in the field of gene prediction to make the most of genome-comparison alignments.

  • The ab initio identification of regulatory regions on a single genome often gives sensitive, but not highly specific, results. Comparative genomic data allow a significant increase in the specificity of such processes.

Abstract

The increasing number of complete and nearly complete metazoan genome sequences provides a significant amount of material for large-scale comparative genomic analysis. Finding new effective methods to analyse such enormous datasets has been the object of intense research. Three main areas in comparative genomics have recently shown important developments: whole-genome alignment, gene prediction and regulatory-region prediction. Each of these areas improves the methods of deciphering long genomic sequences and uncovering what lies hidden in them.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Evolutionary relationship between metazoans that are sequenced or due for sequencing.
Figure 2: Whole-genome alignments available online.
Figure 3: Schematic description of whole-genome alignment processes.

Similar content being viewed by others

References

  1. The C. elegans Genome Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).

  2. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

    PubMed  Google Scholar 

  3. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    CAS  PubMed  Google Scholar 

  4. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    Article  CAS  PubMed  Google Scholar 

  5. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).

    CAS  PubMed  Google Scholar 

  6. Holt, R. A. et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298, 129–149 (2002).

    CAS  PubMed  Google Scholar 

  7. Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). The first whole-genome comparative analysis of two mammalian organisms.

    CAS  PubMed  Google Scholar 

  8. Dehal, P. et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298, 2157–2167 (2002). References 1–8 are the original publications for some of the sequenced 'entire' metazoan genomes.

    CAS  PubMed  Google Scholar 

  9. Bolshakov, V. N. et al. A comparative genomic analysis of two distant diptera, the fruit fly, Drosophila melanogaster, and the malaria mosquito, Anopheles gambiae. Genome Res. 12, 57–66 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Zdobnov, E. M. et al. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298, 149–159 (2002). One of three papers about the Anopheles genome, which gives a good flavour of more distant comparative genomics compared with the inter-mammal papers.

    CAS  PubMed  Google Scholar 

  11. Dehal, P. et al. Human chromosome 19 and related regions in mouse: conservative and lineage-specific evolution. Science 293, 104–111 (2001).

    CAS  PubMed  Google Scholar 

  12. Mural, R. J. et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296, 1661–1671 (2002).

    CAS  PubMed  Google Scholar 

  13. Hedges, S. B. The origin and evolution of model organisms. Nature Rev. Genet. 3, 838–849 (2002).

    CAS  PubMed  Google Scholar 

  14. Graur, D. & Wen-Hsiung, L. Fundamentals of Molecular Evolution (Sinauer Associates, Inc., Sunderland, Massachusetts, 2000).

    Google Scholar 

  15. Goff, S. A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100 (2002).

    CAS  PubMed  Google Scholar 

  16. Freeling, M. Grasses as a single genetic system: reassessment 2001. Plant Physiol. 125, 1191–1197 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Van Sluys, M. A. et al. Comparative genomic analysis of plant-associated bacteria. Annu. Rev. Phytopath. 40, 169–189 (2002).

    CAS  Google Scholar 

  18. Edwards, R. A., Olsen, G. J. & Maloy, S. R. Comparative genomics of closely related salmonellae. Trends Microbiol. 10, 94–99 (2002).

    CAS  PubMed  Google Scholar 

  19. Brosch, R., Pym, A. S., Gordon, S. V. & Cole, S. T. The evolution of mycobacterial pathogenicity: clues from comparative genomics. Trends Microbiol. 9, 452–458 (2001).

    CAS  PubMed  Google Scholar 

  20. Paulsen, I. T., Chen, J., Nelson, K. E. & Saier, M. H. Comparative genomics of microbial drug efflux systems. J. Mol. Microbiol. Biotech. 3, 145–150 (2001).

    CAS  Google Scholar 

  21. McClelland, M. et al. Comparison of the Escherichia coli K-12 genome with sampled genomes of a Klebsiella pneumoniae and three Salmonella enterica serovars, Typhimurium, Typhi and Paratyphi. Nucleic Acids Res. 28, 4974–4986 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Kimura, M. Evolutionary rate at the molecular level. Nature 217, 624–626 (1968).

    CAS  PubMed  Google Scholar 

  23. King, J. L. & Jukes, T. H. Non-Darwinian evolution. Science 164, 788–798 (1969).

    CAS  PubMed  Google Scholar 

  24. Ohta, T. & Tachida, H. Theoretical study of near neutrality. I. Heterozygosity and rate of mutant substitution. Genetics 126, 219–229 (1990).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Miller, W. Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17, 391–397 (2001).

    CAS  PubMed  Google Scholar 

  26. Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    CAS  PubMed  Google Scholar 

  27. Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).

    CAS  PubMed  Google Scholar 

  28. Mayor, C. et al. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16, 1046–1047 (2000).

    Article  CAS  PubMed  Google Scholar 

  29. Harris, T. W. et al. WormBase: a cross-species database for comparative genomics. Nucleic Acids Res. 31, 133–137 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Clamp, M. et al. Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31, 38–42 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Karolchik, D. et al. The UCSC genome browser database. Nucleic Acids Res. 31, 51–54 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Giardine, B. et al. GALA: a database for genomic sequence alignments and annotations. Genome Res. (in the press).

  33. Pennacchio, L. A. et al. An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294, 169–173 (2001).

    CAS  PubMed  Google Scholar 

  34. Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E. Human–mouse genome comparisons to locate regulatory sites. Nature Genet. 26, 225–228 (2000).

    CAS  PubMed  Google Scholar 

  35. Jareborg, N., Birney, E. & Durbin, R. Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9, 815–824 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235–238 (2000). The first large-scale comparison between two vertebrate genomes.

    CAS  PubMed  Google Scholar 

  37. Gilligan, P., Brenner, S. & Venkatesh, B. Fugu and human sequence comparison identifies novel human genes and conserved non-coding sequences. Gene 294, 35–44 (2002).

    CAS  PubMed  Google Scholar 

  38. Kent, W. J. & Zahler, A. M. Conservation, regulation, synteny, and introns in a large-scale C. briggsaeC. elegans genomic alignment. Genome Res. 10, 1115–1125 (2000). The first software implementation of a pair-HMM to align sequences.

    CAS  PubMed  Google Scholar 

  39. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478–2483 (2002).

    PubMed  PubMed Central  Google Scholar 

  41. Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002). The publication that proposed the original two weighted-spaced model to identify nearly exact matching words

    CAS  PubMed  Google Scholar 

  42. Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B. & Lander, E. S. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10, 950–958 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Mullikin, J. C. & Ning, Z. The phusion assembler. Genome Res. 13, 81–90 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Schwartz, S. et al. Human–mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Chiaromonte, F., Yap, V. B. & Miller, W. Scoring pairwise genomic sequence alignments. Pac. Symp. Biocomput. 115–126 (2002).

  47. Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Bray, N., Dubchak, I. & Pachter, L. AVID: a global alignment program. Genome Res. 13, 97–102 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Couronne, O. et al. Strategies and tools for whole genome alignments. Genome Res. 13, 73–80 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. (in the press).

  51. Schwartz, S. et al. PipMaker — a web server for aligning two genomic DNA sequences. Genome Res. 10, 577–586 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Elnitski, L. et al. PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. Genomics 80, 681–690 (2002).

    CAS  PubMed  Google Scholar 

  53. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. in Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids 80–99 (Cambridge Univ. Press, Cambridge, UK, 1998).

    Google Scholar 

  54. Chiaromonte, F. et al. Association between divergence and interspersed repeats in mammalian noncoding genomic DNA. Proc. Natl Acad. Sci USA 98, 14503–14508 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. Zhang, M. Q. Computational prediction of eukaryotic protein-coding genes. Nature Rev. Genet 3, 698–709 (2002).

    CAS  PubMed  Google Scholar 

  56. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    CAS  PubMed  Google Scholar 

  57. Alexandersson, M., Cawley, S. & Pachter, L. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. (in the press). The most complete pair-HMM model for gene prediction that has been implemented so far.

  58. Pachter, L., Alexandersson, M. & Cawley, S. Applications of generalized pair hidden Markov models to alignment and gene finding problems. J. Comput. Biol. 9, 389–399 (2002).

    CAS  PubMed  Google Scholar 

  59. Meyer, I. M. & Durbin, R. Comparative ab initio prediction of gene structures using pair-HMMs. Bioinformatics 18, 1309–1318 (2002).

    CAS  PubMed  Google Scholar 

  60. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17 (Suppl.) S140–S148 (2001). This paper describes Twinscan, which is one of the best informant-HMM gene-modelling approaches.

    PubMed  Google Scholar 

  61. Wiehe, T., Gebauer-Jung, S., Mitchell-Olds, T. & Guigo, R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res. 11, 1574–1583 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. Tagle, D. A. et al. Embryonic ε- and γ-globin genes of a prosimian primate (Galago crassicaudatus): nucleotide and amino-acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203, 439–455 (1988).

    CAS  PubMed  Google Scholar 

  64. Levy, S., Hannenhalli, S. & Workman, C. Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics 17, 871–877 (2001).

    CAS  PubMed  Google Scholar 

  65. Fickett, J. W. & Wasserman, W. W. Discovery and modeling of transcriptional regulatory regions. Curr. Opin. Biotechnol. 11, 19–24 (2000).

    CAS  PubMed  Google Scholar 

  66. Aparicio, S. et al. Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc. Natl Acad. Sci. USA 92, 1684–1688 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. Flint, J. et al. Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the α-globin cluster. Hum. Mol. Genet. 10, 371–382 (2001).

    CAS  PubMed  Google Scholar 

  68. Webb, C. T., Shabalina, S. A., Ogurtsov, A. Y. & Kondrashov, A. S. Analysis of similarity within 142 pairs of orthologous intergenic regions of Caenorhabditis elegans and Caenorhabditis briggsae. Nucleic Acids Res. 30, 1233–1239 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. Dieterich, C. et al. Annotating regulatory DNA based on man–mouse genomic comparison. Bioinformatics 18 (Suppl.), S84–S90 (2002).

    PubMed  Google Scholar 

  70. Praz, V., Perier, R., Bonnard, C. & Bucher, P. The eukaryotic promoter database, EPD: new entry types and links to gene expression data. Nucleic Acids Res. 30, 322–324 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Hamdi, H. K., Nishio, H., Tavis, J., Zielinski, R. & Dugaiczyk, A. Alu-mediated phylogenetic novelties in gene regulation and development. J. Mol. Biol. 299, 931–939 (2000).

    CAS  PubMed  Google Scholar 

  72. Liu, T., Wu, J. & He, F. Evolution of cis-acting elements in 5′ flanking regions of vertebrate actin genes. J. Mol. Evol. 50, 22–30 (2000).

    CAS  PubMed  Google Scholar 

  73. Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. Dermitzakis, E. T. & Clark, A. G. Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121 (2002). The authors estimate that 30–40% of the functional cis -acting elements in human are not functional in rodents.

    CAS  PubMed  Google Scholar 

  75. Ludwig, M. Z., Bergman, C., Patel, N. H. & Kreitman, M. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 403, 564–567 (2000). A publication that studies the compensatory mutation and stabilizing selection of cis -acting elements in two species of Drosophila.

    CAS  PubMed  Google Scholar 

  76. Elnitski, L. et al. Distinguishing regulatory DNA from neutral sites. Genome Res. 13, 64–72 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. Bailey, T. L. & Elkan, C. The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 21–29 (1995).

    CAS  PubMed  Google Scholar 

  78. Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnol. 16, 939–945 (1998).

    CAS  Google Scholar 

  79. Morgenstern, B., Frech, K., Dress, A. & Werner, T. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998).

    CAS  PubMed  Google Scholar 

  80. Hertz, G. Z. & Stormo, G. D. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999).

    CAS  PubMed  Google Scholar 

  81. Brazma, A., Jonassen, I., Vilo, J. & Ukkonen, E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, 1202–1215 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  82. Hughes, J. D., Estep, P. W., Tavazoie, S. & Church, G. M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000).

    CAS  PubMed  Google Scholar 

  83. Blanchette, M. & Tompa, M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12, 739–748 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  84. Gottgens, B. et al. Transcriptional regulation of the stem cell leukemia gene (SCL) —comparative analysis of five vertebrate SCL loci. Genome Res. 12, 749–759 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000).

    CAS  PubMed  Google Scholar 

  86. Dubchak, I. et al. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10, 1304–1306 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  87. Zhu, J., Liu, J. S. & Lawrence, C. E. Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25–39 (1998).

    CAS  PubMed  Google Scholar 

  88. Berman, B. P. et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA 99, 757–762 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. Levy, S. & Hannenhalli, S. Identification of transcription factor binding sites in the human genome sequence. Mamm. Genome 13, 510–514 (2002).

    CAS  PubMed  Google Scholar 

  90. Chao, K. M., Hardison, R. C. & Miller, W. Recent developments in linear-space alignment methods: a survey. J. Comput. Biol. 1, 271–291 (1994).

    CAS  PubMed  Google Scholar 

  91. Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994).

    CAS  PubMed  PubMed Central  Google Scholar 

  92. Corpet, F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881–10890 (1988).

    CAS  PubMed  PubMed Central  Google Scholar 

  93. Stojanovic, N. et al. Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res. 27, 3899–3910 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  94. Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  95. Quandt, K., Frech, K., Karas, H., Wingender, E. & Werner, T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878–4884 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  96. Jegga, A. G. et al. Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res. 12, 1408–1417 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  97. Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak, I. & Rubin, E. M. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12, 832–839 (2002).

    PubMed  PubMed Central  Google Scholar 

  98. Hedges, S. B. & Kumar, S. Genomics: vertebrate genomes compared. Science 297, 1283–1285 (2002).

    CAS  PubMed  Google Scholar 

  99. Venkatesh, B., Gilligan, P. & Brenner, S. Fugu: a compact vertebrate reference genome. FEBS Lett. 476, 3–7 (2000).

    CAS  PubMed  Google Scholar 

  100. Wittbrodt, J., Shima, A. & Schartl, M. Medaka — a model organism from the Far East. Nature Rev. Genet. 3, 53–64 (2002).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank M. Brudno, W. Miller and L. Pachter for providing their respective manuscripts before publication and W. J. Kent, W. Miller, M. Brudno and L. Bentolila for helpful discussion and comments on the manuscript. We also thank the anonymous reviewers for many helpful suggestions. A.U.-V. is funded by the Wellcome Trust. E.B. and L.E. are funded by the European Molecular Biology Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ewan Birney.

Related links

Related links

DATABASES

LocusLink

APOAV

even skipped

FURTHER INFORMATION

Berkeley Genome Pipeline

Comparative analysis of the rat genome

ECR Browser

Ensembl Genome Browser

Penn State Bioinformatics Group

UCSC Genome Browser

WormBase

Glossary

RATE MATRIX

Denotes the probability of mutation from one amino acid to another (or from one nucleotide to another) for a given period of evolution. The most well known rate matrices are BLOSUM and PAM.

FUNCTIONAL SEQUENCE

A genomic sequence that provides a function that is under selection and tends to be conserved between species. For example, a protein-coding region or transcription-factor binding site

SEEDS

A short exact, or nearly exact, matching string od characters aligning between two sequences.

PARALOGUES

Sequences, or genes, that have originated from a common ancestral sequence, or gene, by a duplication event.

ORTHOLOGUES

Sequences, or genes, that have originated from a common ancestral sequence, or gene, by a speciation event.

SYNTENIC REGION

A genomic region that is collinear in the order of genes (or of other DNA sequences) in a chromosomal region of two species.

SYNTENIC ANCHORS

Short aligned segments between genome sequences from two species, which are believed to define an orthologous relationship.

DOT PLOT MATRIX

A visualization technique that allows the easy identification of matching nucleotides or amino acids (letters) between two sequences. For example, for two sequences X and Y, each letter has a unique coordinate on the x axis and the y axis respectively. When two letters are the same at a specified coordinate, a dot is plotted in the matrix at that position.

HIDDEN MARKOV MODEL

(HMM). A probabilistic model that is applied to protein- and DNA-sequence pattern recognition. HMMs represent a system as a set of discrete states and as transitions between those states. Each transition has an associated probability. HMMs are valuable because they enable a search or alignment algorithm to be built on firm probabilistic bases, and the parameters (transition probabilities) can be easily trained on a known data set.

DISCRIMINANT FUNCTIONS

Classical statistical pattern-recognition methods that are used to categorize samples into two classes of data.

NEURAL NETWORKS

Mathematical models inspired by analogy with biological neurons to distinguish two or more classes of data.

SCORE MAXIMIZATION PROCESS

Many algorithms attempt to find the solution, under a scoring scheme, that is believed to best reflect reality. For 'simple' models, including hidden Markov models, precise mathematical formulae can be used that will guarantee to find the highest score.

NEUTRAL DRIFT

The process by which a DNA sequence acquires many mutations over time that have no phenotypic effect, and are not acted on by Darwinian selection.

STABILIZING SELECTION

Selection that favours intermediate phenotypes over extreme phenotypes.

GAP PENALTY

Alignment programs deal with insertions and deletions (indels) by introducing a 'gap' in the sequence that contains the deletion. The introduction of gaps and their extension decreases the overall alignment score by a certain value. This value is defined by a gap-opening penalty and a gap-extension penalty, both of which are used as parameters in alignment programs.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat Rev Genet 4, 251–262 (2003). https://doi.org/10.1038/nrg1043

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg1043

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing