Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome

Gopal, Shuba; Schroeder, Mark; Pieper, Ursula; Sczyrba, Alexander; Aytekin-Kurban, Gulriz; Bekiranov, Stefan; Eduardo Fajardo, J.; Eswar, Narayanan; Sanchez, Roberto; Sali, Andrej; Gaasterland, Terry

doi:10.1038/85922

Letter
Published: March 2001

Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome

Shuba Gopal¹^na1,
Mark Schroeder¹^na1,
Ursula Pieper^1,2,
Alexander Sczyrba¹,
Gulriz Aytekin-Kurban¹,
Stefan Bekiranov¹,
J. Eduardo Fajardo¹,
Narayanan Eswar²,
Roberto Sanchez²,
Andrej Sali² &
…
Terry Gaasterland¹

Nature Genetics volume 27, pages 337–340 (2001)Cite this article

171 Accesses
47 Citations
3 Altmetric
Metrics details

Abstract

The approach to annotating a genome critically affects the number and accuracy of genes identified in the genome sequence. Genome annotation based on stringent gene identification is prone to underestimate the complement of genes encoded in a genome. In contrast, over-prediction of putative genes followed by exhaustive computational sequence, motif and structural homology search will find rarely expressed, possibly unique, new genes at the risk of including non-functional genes. We developed a two-stage approach that combines the merits of stringent genome annotation with the benefits of over-prediction. First we identify plausible genes regardless of matches with EST, cDNA or protein sequences from the organism (stage 1). In the second stage, proteins predicted from the plausible genes are compared at the protein level with EST, cDNA and protein sequences, and protein structures from other organisms (stage 2). Remote but biologically meaningful protein sequence or structure homologies provide supporting evidence for genuine genes. The method, applied to the Drosophila melanogaster genome, validated 1,042 novel candidate genes after filtering 19,410 plausible genes, of which 12,124 matched the original 13,601 annotated genes¹. This annotation strategy is applicable to genomes of all organisms, including human.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

The ENCODE Blacklist: Identification of Problematic Regions of the Genome

Article Open access 27 June 2019

Hypothesis-free phenotype prediction within a genetics-first framework

Article Open access 17 February 2023

The mutational constraint spectrum quantified from variation in 141,456 humans

Article Open access 27 May 2020

References

Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
Article Google Scholar
Rubin, G.M. et al. A Drosophila complementary DNA resource. Science 287, 2222–2224 (2000).
Article CAS Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article CAS Google Scholar
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).
Article CAS Google Scholar
Reese, M.G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).
Article CAS Google Scholar
Boguski, M.S., Tolstoshev, C.M. & Bassett, D.E. Gene discovery in dbEST. Science 265, 1993–1994 (1994).
Article CAS Google Scholar
Gaasterland, T. & Ragan, M.A. Constructing multigenome views of whole microbial genomes. Microb. Comp. Genomics 3, 177–192 (1998).
Article CAS Google Scholar
Benson, D.A. et al. GenBank. Nucleic Acids Res. 27, 12–17 (1999).
Article CAS Google Scholar
Bhat, T.N. et al. The PDB data uniformity project. Nucleic Acids Res. 29, 214–218 (2001).
Article CAS Google Scholar
Deckert, G. et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358 (1998).
Article CAS Google Scholar
Gaasterland, T. et al. MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region. Genome Res. 10, 502–510 (2000).
Article CAS Google Scholar
Sánchez, R. & Sali, A. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602 (1998).
Article Google Scholar
Sánchez, R. & Sali, A. ModBase: a database of comparative protein structure models. Bioinformatics 15, 1060–1061 (1999).
Article Google Scholar
Sánchez, R. & Sali, A. Evaluation of comparative protein structure modeling by MODELLER -3. Proteins Suppl. 1, 50–58 (1997).
Martí-Renom, M.A. et al. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291–325 (2000).
Article Google Scholar
Reese, M.G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).
Article CAS Google Scholar
Strausberg, R.L., Feingold, E.A., Klausner, R.D. & Collins, F.S. The mammalian gene collection. Science 286, 455–457 (1999).
Article CAS Google Scholar
Reboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nature Genet. 27, 332–336 (2001).
Article CAS Google Scholar
Burley, S.K. et al. Structural genomics: beyond the human genome project. Nature Genet. 23, 151–157 (1999).
Article CAS Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Article CAS Google Scholar
Henikoff, J., Henikoff, S. & Pietrokovski, S. New features of the Blocks Database servers. Nucleic Acids Res. 27, 226–228 (1999).
Article CAS Google Scholar
Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–219 (1999).
Article CAS Google Scholar
Altschul, S.F. & Koonin, E.V. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci. 23, 444–447 (1998).
Article CAS Google Scholar
Sali, A. & Blundell, T.L. Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).
Article CAS Google Scholar
Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res. 27, 260–262 (1999).
Article CAS Google Scholar

Download references

Acknowledgements

We thank S. Burley, M. Vidal, J. Sorge, J. Goncalves, M. Ashburner, S. Lewis, M. Young and U. Gaul for insights and comments. This work was partially supported by the Mathers, Sinsheimer and Mallinkrodt Foundations, National Cancer Institute Health grant R33CA84699, National Institutes of Health grant P50GM62529, and the National Science Foundation grant DBI-9984882.

Author information

Shuba Gopal and Mark Schroeder: These authors contributed equally to this work.

Authors and Affiliations

Laboratories of Computational Genomics, The Rockefeller University, New York, New York, USA
Shuba Gopal, Mark Schroeder, Ursula Pieper, Alexander Sczyrba, Gulriz Aytekin-Kurban, Stefan Bekiranov, J. Eduardo Fajardo & Terry Gaasterland
Biophysics, The Rockefeller University, New York, New York, USA
Ursula Pieper, Narayanan Eswar, Roberto Sanchez & Andrej Sali

Authors

Shuba Gopal
View author publications
You can also search for this author in PubMed Google Scholar
Mark Schroeder
View author publications
You can also search for this author in PubMed Google Scholar
Ursula Pieper
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Sczyrba
View author publications
You can also search for this author in PubMed Google Scholar
Gulriz Aytekin-Kurban
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Bekiranov
View author publications
You can also search for this author in PubMed Google Scholar
J. Eduardo Fajardo
View author publications
You can also search for this author in PubMed Google Scholar
Narayanan Eswar
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Sanchez
View author publications
You can also search for this author in PubMed Google Scholar
Andrej Sali
View author publications
You can also search for this author in PubMed Google Scholar
Terry Gaasterland
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Terry Gaasterland.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gopal, S., Schroeder, M., Pieper, U. et al. Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome. Nat Genet 27, 337–340 (2001). https://doi.org/10.1038/85922

Download citation

Received: 22 December 2000
Accepted: 07 February 2001
Issue Date: March 2001
DOI: https://doi.org/10.1038/85922

This article is cited by

Characterization of genes coding for galacturonosyltransferase-like (GATL) proteins in rice
- Jinlong Liu
- Mansi Luo
- Shaobo Li
Genes & Genomics (2016)
Genome-wide identification, classification and expression analysis of GHMP genes family in Arabidopsis thaliana
- Wenjun Xiao
- Hongping Chang
- Xinhong Guo
Plant Systematics and Evolution (2015)
Comparative characterization, expression pattern and function analysis of the 12-oxo-phytodienoic acid reductase gene family in rice
- Wenyan Li
- Feng Zhou
- Jinfa Wang
Plant Cell Reports (2011)
A high-quality catalog of the Drosophila melanogaster proteome
- Erich Brunner
- Christian H Ahrens
- Ruedi Aebersold
Nature Biotechnology (2007)
GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes
- David MA Martin
- Matthew Berriman
- Geoffrey J Barton
BMC Bioinformatics (2004)

Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome

Abstract

Access options

Similar content being viewed by others

The ENCODE Blacklist: Identification of Problematic Regions of the Genome

Hypothesis-free phenotype prediction within a genetics-first framework

The mutational constraint spectrum quantified from variation in 141,456 humans

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

This article is cited by

Characterization of genes coding for galacturonosyltransferase-like (GATL) proteins in rice

Genome-wide identification, classification and expression analysis of GHMP genes family in Arabidopsis thaliana

Comparative characterization, expression pattern and function analysis of the 12-oxo-phytodienoic acid reductase gene family in rice

A high-quality catalog of the Drosophila melanogaster proteome

GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes

Search

Quick links

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links