Gene prediction and verification in a compact genome with numerous small introns

  1. Aaron E. Tenney1,
  2. Randall H. Brown1,
  3. Charles Vaske1,4,
  4. Jennifer K. Lodge2,
  5. Tamara L. Doering3, and
  6. Michael R. Brent1,5
  1. 1 Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA
  2. 2 Department of Biochemistry and Molecular Biology, Saint Louis University School of Medicine, St. Louis, Missouri 63104, USA
  3. 3 Department of Molecular Microbiology, Washington University Medical School, St. Louis, Missouri 63110-1093, USA

Abstract

The genomes of clusters of related eukaryotes are now being sequenced at an increasing rate, creating a need for accurate, low-cost annotation of exon–intron structures. In this paper, we demonstrate that reverse transcription-polymerase chain reaction (RT–PCR) and direct sequencing based on predicted gene structures satisfy this need, at least for single-celled eukaryotes. The TWINSCAN gene prediction algorithm was adapted for the fungal pathogen Cryptococcus neoformans by using a precise model of intron lengths in combination with ungapped alignments between the genome sequences of the two closely related Cryptococcus varieties. This approach resulted in ∼60% of known genes being predicted exactly right at every coding base and splice site. When previously unannotated TWINSCAN predictions were tested by RT–PCR and direct sequencing, 75% of targets spanning two predicted introns were amplified and produced high-quality sequence. When targets spanning the complete predicted open reading frame were tested, 72% of them amplified and produced high-quality sequence. We conclude that sequencing a small number of expressed sequence tags (ESTs) to provide training data, running TWINSCAN on an entire genome, and then performing RT–PCR and direct sequencing on all of its predictions would be a cost-effective method for obtaining an experimentally verified genome annotation.

Footnotes

  • [All sequences, predictions, primers, traces, accession numbers, and links to software are available at http://genes.cse.wustl.edu/tenney-04-crypto-data/].

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2816704. Article published online before print in October 2004.

  • 4 Present address: Dept. of Biomolecular Engineering, Univ. of California–Santa Cruz, Santa Cruz, California 95064, USA.

  • 5 Corresponding author. E-mail brent{at}cse.wustl.edu; fax (314) 935-7302.

    • Accepted August 12, 2004.
    • Received April 21, 2004.
| Table of Contents

Preprint Server