Computational Inference of Homologous Gene Structures in the Human Genome

  1. Ru-Fang Yeh1,
  2. Lee P. Lim1,2, and
  3. Christopher B. Burge1,3
  1. 1 Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA; 2 Center for Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

Abstract

With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon–intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows thatGenomeScan can accurately identify the exon–intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000–25,000 human genes out of an estimated 30,000–40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.

Footnotes

  • 3 Corresponding author.

  • E-MAIL cburge{at}mit.edu; FAX 617-253-3128.

  • Article and publication are at www.genome.org/cgi/doi/10.1101/gr.175701.

    • Received December 14, 2000.
    • Accepted February 27, 2001.
| Table of Contents

Preprint Server