Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner

  1. Mathieu Blanchette1,6,
  2. W. James Kent2,
  3. Cathy Riemer3,
  4. Laura Elnitski3,
  5. Arian F.A. Smit4,
  6. Krishna M. Roskin2,
  7. Robert Baertsch2,
  8. Kate Rosenbloom2,
  9. Hiram Clawson2,
  10. Eric D. Green5,
  11. David Haussler1,2, and
  12. Webb Miller3,7
  1. 1 Howard Hughes Medical Institute, University of California at Santa Cruz, Santa Cruz, California 95064, USA
  2. 2 Center for Biomolecular Science and Engineering, University of California at Santa Cruz, Santa Cruz, California 95064, USA
  3. 3 Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
  4. 4 Institute for Systems Biology, Seattle, Washington 98103, USA
  5. 5 Genome Technology Branch and NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA

Abstract

We define a “threaded blockset,” which is a novel generalization of the classic notion of a multiple alignment. A new computer program called TBA (for “threaded blockset aligner”) builds a threaded blockset under the assumption that all matching segments occur in the same order and orientation in the given sequences; inversions and duplications are not addressed. TBA is designed to be appropriate for aligning many, but by no means all, megabase-sized regions of multiple mammalian genomes. The output of TBA can be projected onto any genome chosen as a reference, thus guaranteeing that different projections present consistent predictions of which genomic positions are orthologous. This capability is illustrated using a new visualization tool to view TBA-generated alignments of vertebrate Hox clusters from both the mammalian and fish perspectives. Experimental evaluation of alignment quality, using a program that simulates evolutionary change in genomic sequences, indicates that TBA is more accurate than earlier programs. To perform the dynamic-programming alignment step, TBA runs a stand-alone program called MULTIZ, which can be used to align highly rearranged or incompletely sequenced genomes. We describe our use of MULTIZ to produce the whole-genome multiple alignments at the Santa Cruz Genome Browser.

Footnotes

  • [Supplemental material, including the Methods section, is available online at www.genome.org. The multiple alignments produced by MULTIZ can be viewed at the Santa Cruz Genome Browser or downloaded in bulk. TBA, simulated test data, and the Gmaj visualization tool can be downloaded from http://bio.cse.psu.edu/.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1933104.

  • 6 Present address: School of Computer Science, McGill University, Montreal, Canada.

  • 7 Corresponding author. E-MAIL webb{at}bx.psu.edu; FAX (814) 863-1357.

    • Accepted February 3, 2004.
    • Received September 2, 2003.
| Table of Contents

Preprint Server