Toward a Catalog of Human Genes and Proteins: Sequencing and Analysis of 500 Novel Complete Protein Coding Human cDNAs

  1. Stefan Wiemann1,12,
  2. Bernd Weil1,2,
  3. Ruth Wellenreuther1,
  4. Johannes Gassenhuber1,2,
  5. Sabine Glassl3,
  6. Wilhelm Ansorge3,
  7. Michael Böcher4,
  8. Helmut Blöcker4,
  9. Stefan Bauersachs5,
  10. Helmut Blum5,
  11. Jürgen Lauber6,
  12. Andreas Düsterhöft6,
  13. Andreas Beyer7,
  14. Karl Köhrer7,
  15. Normann Strack2,
  16. Hans-Werner Mewes2,
  17. Birgit Ottenwälder8,
  18. Brigitte Obermaier8,
  19. Jens Tampe9,
  20. Dagmar Heubner10,
  21. Rolf Wambutt10,
  22. Bernhard Korn1,11,
  23. Michaela Klein1, and
  24. Annemarie Poustka1
  1. 1Molecular Genome Analysis, German Cancer Research Center, 69120 Heidelberg, Germany; 2MIPS, GSF, 82152 Martinsried, Germany; 3Biochemical Instrumentation, European Molecular Biology Laboratory, 69117 Heidelberg, Germany; 4GBF–Genome Analysis, 38124 Braunschweig, Germany; 5Genzentrum der LMU München, 81377 München, Germany; 6QIAGEN GmbH, 40724 Hilden, Germany; 7Biologisch-Medizinisches Forschungszentrum, Heinrich-Heine-Universität Düsseldorf, 40225 Düsseldorf, Germany; 8MediGenomix GmbH, 82152 Martinsried, Germany; 9Fraunhofer Gesellschaft, 80636 München, Germany; 10AGOWA GmbH, 12489 Berlin, Germany; 11Resource Center of the German Genome Project, 69120 Heidelberg, Germany

Abstract

With the complete human genomic sequence being unraveled, the focus will shift to gene identification and to the functional analysis of gene products. The generation of a set of cDNAs, both sequences and physical clones, which contains the complete and noninterrupted protein coding regions of all human genes will provide the indispensable tools for the systematic and comprehensive analysis of protein function to eventually understand the molecular basis of man. Here we report the sequencing and analysis of 500 novel human cDNAs containing the complete protein coding frame. Assignment to functional categories was possible for 52% (259) of the encoded proteins, the remaining fraction having no similarities with known proteins. By aligning the cDNA sequences with the sequences of the finished chromosomes 21 and 22 we identified a number of genes that either had been completely missed in the analysis of the genomic sequences or had been wrongly predicted. Three of these genes appear to be present in several copies. We conclude that full-length cDNA sequencing continues to be crucial also for the accurate identification of genes. The set of 500 novel cDNAs, and another 1000 full-coding cDNAs of known transcripts we have identified, adds up to cDNA representations covering 2%–5 % of all human genes. We thus substantially contribute to the generation of a gene catalog, consisting of both full-coding cDNA sequences and clones, which should be made freely available and will become an invaluable tool for detailed functional studies.

[The sequence data described in this paper have been submitted to the EMBL database under the accession nos. given in Table 2.]

Footnotes

  • 12 Corresponding author.

  • E-MAIL s.wiemann{at}dkfz.de; FAX 49-6221-4252-4702.

  • Article published on-line before print: Genome Res., 10.1101/gr.154701.

  • Article and publication are at www.genome.org/cgi/doi/10.1101/gr.154701.

    • Received July 6, 2000.
    • Accepted December 29, 2000.
| Table of Contents

Preprint Server