Comparative genomics as a tool for gene discovery

https://doi.org/10.1016/j.copbio.2006.01.007Get rights and content

With the increasing availability of data from multiple eukaryotic genome sequencing projects, attention has focused on interspecific comparisons to discover novel genes and transcribed genomic sequences. Generally, these extrinsic strategies combine ab initio gene prediction with expression and/or homology data to identify conserved gene candidates between two or more genomes. Interspecific sequence analyses have proven invaluable for the improvement of existing annotations, automation of annotation, and identification of novel coding regions and splice variants. Further, comparative genomic approaches hold the promise of improved prediction of terminal or small exons, microRNA precursors, and small peptide-encoding open reading frames — sequence elements that are difficult to identify through purely intrinsic methodologies in the absence of experimental data.

Introduction

The publication of the genome sequence for the yeast Saccharomyces cerevisiae in 1996 [1] ushered in the genomic era for the eukaryotic research community. Subsequently, the genome sequences of Caenorhabditis elegans [2], Drosophila melanogaster [3], Arabidopsis thaliana [4], and human [5] were published. These prototype genome projects provided biologists with the power to evaluate experimental observations in a whole-genome context. Technical innovations in molecular biology, biochemistry, and information processing precipitated by these projects have made high-throughput tools cost-effective and accessible to a wider range of investigators.

The need to improve annotation and gene identification in the prototype genome sequences, the desire to investigate natural variation and genome evolution, and the recognition of the practical limitations to gene discovery through the study of individual genomes has placed an emphasis on comparative studies. As such, there has been an expansion in the number of completed eukaryotic genome sequencing projects (∼80) and ongoing projects (∼500) over the past five years (Genomes OnLine Database v2.0; http://www.genomesonline.org). The sequencing of additional genomes poses novel logistical and technical issues for the processing and interpretation of sequence data. Unlike the prototype eukaryotic genome projects, extensive manual curation of next-generation sequencing projects is neither time- nor resource-effective. Beginning with the annotation of the mouse genome [6], the partial or complete automation of genome sequence curation has become the norm.

This review will explore recent advances in the prediction and refinement of gene models, the empirical validation of these models, and the identification of non-coding transcribed sequences using comparative genomic approaches. While drawing on the literature at large, the utility of the approaches will be evaluated relative to the current state of plant genomics. Table 1 summaries the methodologies presented in this review that have been formalized as discrete programs or computational pipelines and are available to the research community.

Section snippets

Generalities of gene discovery

De novo gene prediction frameworks are classified as either intrinsic or extrinsic. Intrinsic methodologies (Figure 1; path ‘A’) make gene predictions from only the information present in the individual DNA sequence analyzed. These methodologies are commonly encountered as ab initio tools [7, 8, 9, 10] and are, by definition, not comparative. Ab initio gene prediction algorithms display high sensitivity, but a low specificity (see Glossary) in their output models; both of these parameters are

The application of expression data to gene discovery

Evidence-based gene discovery frameworks integrate empirical transcription and protein expression data with genome sequence to produce gene models (Figure 1; path ‘C’) and facilitate annotation [16]. Such data provide high specificity to gene model prediction, but sensitivity is contingent on the extent of the expression dataset(s). This property negatively impacts the identification of sequences with tightly regulated or low-abundance transcripts or of RNA species that are not translated. The

Sequence similarity applied to gene discovery

Similarity-based methods for gene discovery assume that the evolution of functional sequences is constrained by selection and spurious sequences are free to evolve neutrally. Thus, sequences that are conserved in interspecific comparisons are more likely to be biologically meaningful. Two recent studies [29, 30•] have attempted to determine the minimum number of genome sequences that are required for the identification of conserved regions. Modeling suggests that the number of required

Conclusions

Comparative approaches are proving their value for gene discovery and annotation improvement. Current results indicate that the availability of additional genome sequences and application of combinatorial approaches will further improve efficacy. Although recent studies have used transcription or protein expression data to support novel gene models, little attention has been focused on the functions of the coding regions identified; only one paper reviewed here used mutational and

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Glossary

cDNA
DNA molecule with the complementary sequence to a transcribed RNA.
cRNA
RNA molecule with the complementary sequence to a transcribed RNA.
Expressed sequence tag (EST)
incomplete sequence from a transcribed RNA.
Ka/Ks
in interspecific sequence comparisons, a population genetics parameter used to infer neutral evolution versus selection in coding sequences on a per-site basis. Ka/Ks is the ratio of the number of nonsynonymous substitutions (Ka) to the number of synonymous substitutions (Ks).

References (52)

  • Mouse Genome Sequencing Consortium

    Initial sequencing and comparative analysis of the mouse genome

    Nature

    (2002)
  • W.H. Majoros et al.

    TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

    Bioinformatics

    (2004)
  • I. Korf

    Gene finding in novel genomes

    BMC Bioinformatics

    (2004)
  • C. Wei et al.

    Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions

    Genome Res

    (2005)
  • R. Guigo et al.

    Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes

    Proc Natl Acad Sci USA

    (2003)
  • Z. Wang et al.

    A brief review of computational gene prediction methods

    Genomics Proteomics Bioinformatics

    (2004)
  • H. Yao et al.

    Evaluation of five ab initio gene prediction programs for the discovery of maize genes

    Plant Mol Biol

    (2005)
  • T. Hubbard et al.

    The Ensembl genome database project

    Nucleic Acids Res

    (2002)
  • V.E. Velculescu et al.

    Serial analysis of gene expression

    Science

    (1995)
  • L. Milanesi et al.

    ESTMAP: a system for expressed sequence tags mapping on genomic sequences

    IEEE Trans Nanobioscience

    (2003)
  • L. Ding et al.

    EAnnot: a genome annotation tool using experimental evidence

    Genome Res

    (2004)
  • V. Brendel et al.

    Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus

    Bioinformatics

    (2004)
  • I. Korf et al.

    Integrating genomic homology into gene structure prediction

    Bioinformatics

    (2001)
  • M. Alexandersson et al.

    SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model

    Genome Res

    (2003)
  • G. Parra et al.

    Comparative gene prediction in human and mouse

    Genome Res

    (2003)
  • J.E. Moore et al.

    Gene structure prediction in syntenic DNA segments

    Nucleic Acids Res

    (2003)
  • Cited by (36)

    • The first detection and in vivo pathogenicity characterization of Saprolegnia delica from Kashmir Himalayas

      2021, Aquaculture
      Citation Excerpt :

      The ITS and rRNA sequence of the Saprolegnia isolate (C1M1F10_SP1: MK474656) was closely related to that of S. delica (KF420230, JX212806, JX212905, JX212896, MH030577, KF718022) showing 99% sequence identity, the same is also evidenced by the common clade they form in the phylogenetic tree. The rRNA gene segment comprising of internal transcribed spacers, for example; ITS1-5.8S rRNA-ITS2 gene were used as barcode to characterize the fungal and oomycete isolates and to infer evolutionary distinctiveness and relatedness among the species in a number of previous studies (Windsor and Mitchell-Olds, 2006; Dieguez-Uribeondo et al., 2007; Belbahri et al., 2008; Jiang et al., 2013; Wuensch et al., 2018; Sarowar et al., 2014, Sarowar et al., 2019). The widespread use of such evolutionary chronometers is because rRNA shows a higher rate of evolutionary conservation and divergence (Bruns and Shefferson, 2004; Koljalg et al., 2005).

    • Study on pathogenicity and characterization of disease causing fungal community associated with cultured fish of Kashmir valley, India

      2021, Microbial Pathogenesis
      Citation Excerpt :

      Also, the advances in the curated databases for fungal species identification involves specific region of genomic DNA as barcode gene. The important ones used as barcode region for identification of fungi by all curated database include ITS, 18S rRNA, 28S rRNA, tef 1α, RPB1, RPB2, tub2, calmodulin and actin partially [56,120,122,123,126,127]. The molecular phylogenetics based on multi locus barcode analysis provided an efficient breakthrough in identification of Oomycetes, Ascomycetes and Zygomycetes species [57,60,61,124,128].

    • Mouse Genomics

      2012, The Laboratory Mouse
    • Comparative genomics and function analysis on BI1 family

      2008, Computational Biology and Chemistry
    View all citing articles on Scopus
    View full text