Abstract
The approach to annotating a genome critically affects the number and accuracy of genes identified in the genome sequence. Genome annotation based on stringent gene identification is prone to underestimate the complement of genes encoded in a genome. In contrast, over-prediction of putative genes followed by exhaustive computational sequence, motif and structural homology search will find rarely expressed, possibly unique, new genes at the risk of including non-functional genes. We developed a two-stage approach that combines the merits of stringent genome annotation with the benefits of over-prediction. First we identify plausible genes regardless of matches with EST, cDNA or protein sequences from the organism (stage 1). In the second stage, proteins predicted from the plausible genes are compared at the protein level with EST, cDNA and protein sequences, and protein structures from other organisms (stage 2). Remote but biologically meaningful protein sequence or structure homologies provide supporting evidence for genuine genes. The method, applied to the Drosophila melanogaster genome, validated 1,042 novel candidate genes after filtering 19,410 plausible genes, of which 12,124 matched the original 13,601 annotated genes1. This annotation strategy is applicable to genomes of all organisms, including human.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
Rubin, G.M. et al. A Drosophila complementary DNA resource. Science 287, 2222–2224 (2000).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).
Reese, M.G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).
Boguski, M.S., Tolstoshev, C.M. & Bassett, D.E. Gene discovery in dbEST. Science 265, 1993–1994 (1994).
Gaasterland, T. & Ragan, M.A. Constructing multigenome views of whole microbial genomes. Microb. Comp. Genomics 3, 177–192 (1998).
Benson, D.A. et al. GenBank. Nucleic Acids Res. 27, 12–17 (1999).
Bhat, T.N. et al. The PDB data uniformity project. Nucleic Acids Res. 29, 214–218 (2001).
Deckert, G. et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358 (1998).
Gaasterland, T. et al. MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region. Genome Res. 10, 502–510 (2000).
Sánchez, R. & Sali, A. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602 (1998).
Sánchez, R. & Sali, A. ModBase: a database of comparative protein structure models. Bioinformatics 15, 1060–1061 (1999).
Sánchez, R. & Sali, A. Evaluation of comparative protein structure modeling by MODELLER -3. Proteins Suppl. 1, 50–58 (1997).
Martí-Renom, M.A. et al. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291–325 (2000).
Reese, M.G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).
Strausberg, R.L., Feingold, E.A., Klausner, R.D. & Collins, F.S. The mammalian gene collection. Science 286, 455–457 (1999).
Reboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nature Genet. 27, 332–336 (2001).
Burley, S.K. et al. Structural genomics: beyond the human genome project. Nature Genet. 23, 151–157 (1999).
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Henikoff, J., Henikoff, S. & Pietrokovski, S. New features of the Blocks Database servers. Nucleic Acids Res. 27, 226–228 (1999).
Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–219 (1999).
Altschul, S.F. & Koonin, E.V. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci. 23, 444–447 (1998).
Sali, A. & Blundell, T.L. Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).
Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res. 27, 260–262 (1999).
Acknowledgements
We thank S. Burley, M. Vidal, J. Sorge, J. Goncalves, M. Ashburner, S. Lewis, M. Young and U. Gaul for insights and comments. This work was partially supported by the Mathers, Sinsheimer and Mallinkrodt Foundations, National Cancer Institute Health grant R33CA84699, National Institutes of Health grant P50GM62529, and the National Science Foundation grant DBI-9984882.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gopal, S., Schroeder, M., Pieper, U. et al. Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome. Nat Genet 27, 337–340 (2001). https://doi.org/10.1038/85922
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1038/85922
This article is cited by
-
Characterization of genes coding for galacturonosyltransferase-like (GATL) proteins in rice
Genes & Genomics (2016)
-
Genome-wide identification, classification and expression analysis of GHMP genes family in Arabidopsis thaliana
Plant Systematics and Evolution (2015)
-
Comparative characterization, expression pattern and function analysis of the 12-oxo-phytodienoic acid reductase gene family in rice
Plant Cell Reports (2011)
-
A high-quality catalog of the Drosophila melanogaster proteome
Nature Biotechnology (2007)
-
GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes
BMC Bioinformatics (2004)