Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome

Abstract

The approach to annotating a genome critically affects the number and accuracy of genes identified in the genome sequence. Genome annotation based on stringent gene identification is prone to underestimate the complement of genes encoded in a genome. In contrast, over-prediction of putative genes followed by exhaustive computational sequence, motif and structural homology search will find rarely expressed, possibly unique, new genes at the risk of including non-functional genes. We developed a two-stage approach that combines the merits of stringent genome annotation with the benefits of over-prediction. First we identify plausible genes regardless of matches with EST, cDNA or protein sequences from the organism (stage 1). In the second stage, proteins predicted from the plausible genes are compared at the protein level with EST, cDNA and protein sequences, and protein structures from other organisms (stage 2). Remote but biologically meaningful protein sequence or structure homologies provide supporting evidence for genuine genes. The method, applied to the Drosophila melanogaster genome, validated 1,042 novel candidate genes after filtering 19,410 plausible genes, of which 12,124 matched the original 13,601 annotated genes1. This annotation strategy is applicable to genomes of all organisms, including human.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

References

  1. Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

    Article  Google Scholar 

  2. Rubin, G.M. et al. A Drosophila complementary DNA resource. Science 287, 2222–2224 (2000).

    Article  CAS  Google Scholar 

  3. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    Article  CAS  Google Scholar 

  4. Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).

    Article  CAS  Google Scholar 

  5. Reese, M.G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).

    Article  CAS  Google Scholar 

  6. Boguski, M.S., Tolstoshev, C.M. & Bassett, D.E. Gene discovery in dbEST. Science 265, 1993–1994 (1994).

    Article  CAS  Google Scholar 

  7. Gaasterland, T. & Ragan, M.A. Constructing multigenome views of whole microbial genomes. Microb. Comp. Genomics 3, 177–192 (1998).

    Article  CAS  Google Scholar 

  8. Benson, D.A. et al. GenBank. Nucleic Acids Res. 27, 12–17 (1999).

    Article  CAS  Google Scholar 

  9. Bhat, T.N. et al. The PDB data uniformity project. Nucleic Acids Res. 29, 214–218 (2001).

    Article  CAS  Google Scholar 

  10. Deckert, G. et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358 (1998).

    Article  CAS  Google Scholar 

  11. Gaasterland, T. et al. MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region. Genome Res. 10, 502–510 (2000).

    Article  CAS  Google Scholar 

  12. Sánchez, R. & Sali, A. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602 (1998).

    Article  Google Scholar 

  13. Sánchez, R. & Sali, A. ModBase: a database of comparative protein structure models. Bioinformatics 15, 1060–1061 (1999).

    Article  Google Scholar 

  14. Sánchez, R. & Sali, A. Evaluation of comparative protein structure modeling by MODELLER -3. Proteins Suppl. 1, 50–58 (1997).

  15. Martí-Renom, M.A. et al. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291–325 (2000).

    Article  Google Scholar 

  16. Reese, M.G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).

    Article  CAS  Google Scholar 

  17. Strausberg, R.L., Feingold, E.A., Klausner, R.D. & Collins, F.S. The mammalian gene collection. Science 286, 455–457 (1999).

    Article  CAS  Google Scholar 

  18. Reboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nature Genet. 27, 332–336 (2001).

    Article  CAS  Google Scholar 

  19. Burley, S.K. et al. Structural genomics: beyond the human genome project. Nature Genet. 23, 151–157 (1999).

    Article  CAS  Google Scholar 

  20. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  21. Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).

    Article  CAS  Google Scholar 

  22. Henikoff, J., Henikoff, S. & Pietrokovski, S. New features of the Blocks Database servers. Nucleic Acids Res. 27, 226–228 (1999).

    Article  CAS  Google Scholar 

  23. Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–219 (1999).

    Article  CAS  Google Scholar 

  24. Altschul, S.F. & Koonin, E.V. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci. 23, 444–447 (1998).

    Article  CAS  Google Scholar 

  25. Sali, A. & Blundell, T.L. Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).

    Article  CAS  Google Scholar 

  26. Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res. 27, 260–262 (1999).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank S. Burley, M. Vidal, J. Sorge, J. Goncalves, M. Ashburner, S. Lewis, M. Young and U. Gaul for insights and comments. This work was partially supported by the Mathers, Sinsheimer and Mallinkrodt Foundations, National Cancer Institute Health grant R33CA84699, National Institutes of Health grant P50GM62529, and the National Science Foundation grant DBI-9984882.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Terry Gaasterland.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gopal, S., Schroeder, M., Pieper, U. et al. Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome. Nat Genet 27, 337–340 (2001). https://doi.org/10.1038/85922

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1038/85922

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing