Review
Special Issue: Systems Biology
Towards revealing the functions of all genes in plants

https://doi.org/10.1016/j.tplants.2013.10.006Get rights and content

Highlights

  • Limited plant genome annotation is a bottleneck in plant biology.

  • Omics data can help infer gene function using computational methods.

  • Co-function network-based function inference is underutilized in plant biology.

The great recent progress made in identifying the molecular parts lists of organisms revealed the paucity of our understanding of what most of the parts do. In this review, we introduce computational and statistical approaches and omics data used for inferring gene function in plants, with an emphasis on network-based inference. We also discuss caveats associated with network-based function predictions such as performance assessment, annotation propagation, the guilt-by-association concept, and the meaning of hubs. Finally, we note the current limitations and possible future directions such as the need for gold standard data from several species, unified access to data and tools, quantitative comparison of data and tool quality, and high-throughput experimental validation platforms for systematic gene function elucidation in plants.

Section snippets

How little we know

The elucidation of the genome sequence of many organisms, one of the outstanding achievements of our generation, confirmed what most biologists already suspected – that we know little about what most genes do. For example, approximately 40% of Arabidopsis (Arabidopsis thaliana, thale cress) and 1% of rice (Oryza sativa) protein-coding genes have had some aspect of their functions annotated based on experimental evidence (Figure 1) 1, 2. Moreover, we know about the biochemical activity,

What's in a function?

Gene function can mean different things to different people. Therefore, it is important to use controlled vocabularies for defining the function explicitly [3]. It is also helpful to use the same vocabularies for describing functions to maximize comparability across species. The Open Biological Ontologies consortium provides a set of guidelines for creating and improving ontologies and a forum for sharing them [4]. The Gene Ontology (GO) vocabulary system exemplifies the minimal information

What's in a network?

Just as a function can have different meanings, a network can also have different meanings and purposes in biology. Molecular networks that have been generated can be grouped into three categories: associational, informational, and mechanistic. Associational networks are akin to social networks such as Facebook or LinkedIn. We can guess things about a gene (or person) based on other genes (or people) it is connected to. For example, properties of genes can be identified from omics data and used

Omics data used in inferring gene function

Omics data can help elucidate functions of gene products either by direct measurement or usage in inference programs. Typically, a particular type of omics data is useful for elucidating functions in a particular GO domain. For example, sequencing peptides from isolated subcellular compartments is useful for assigning gene products to cellular components [15], but is less valuable for assigning to biological processes. In addition, similarity between protein sequences enables molecular

Process of systematic gene function elucidation

Most of the omics data can be used to build co-function networks that are useful for inferring biological processes. The process of biological process inference using co-function networks can be broken down into seven steps, as shown in Figure 2. Typically, function inference uses the guilt-by-association concept that tries to find similarity between characterized and uncharacterized genes based on some shared feature, and transfers the annotation from knowledge donor (gene of known function)

Integrating co-function networks

There are two advantages in integrating different types of omics data to construct co-function networks [57]. First, one type of data often reveals only one domain of gene function. Therefore, combining data types can increase prediction coverage. Second, a predicted functional association between two genes is more likely to be true if it is supported by multiple, independent data sources. Various data types have been integrated and used for biological process prediction in Arabidopsis 12, 54.

Predicting and validating gene function

The biological processes of genes can be inferred from co-function networks either by using the enriched (statistically overrepresented) functions of network neighbors or by clustering genes and identifying the enriched or majority functions of the genes within a cluster 62, 67 (Figure 2E). Clustering techniques largely fall into three categories: hierarchical, partitioning, and density-based (reviewed in [68]). Popular algorithms include Markov Cluster Algorithm (MCL) for its efficiency and

Caveats for gene function prediction

There are some caveats involved in function inference, which we will briefly discuss to help scientists identify which predictions are likely to be less reliable, and indicate the areas that are likely to be the focus of future research.

Where do we go from here?

Network-based function inference, despite having some caveats, holds great promise for accelerating gene function discovery in plants. However, there are some bottlenecks that we must overcome to achieve the goal of understanding the function of all genes in a plant genome.

Concluding remarks

Although network-based gene function prediction has been an active area of research for the past 15 years, its use in plant science has been limited. To exploit this underused technology we need, (i) more data; (ii) better assessment of data and tool quality; (iii) easy access to the data and tools; and (iv) high-throughput experimental validation. Many discoveries in plant science were made without knowing what most of the genes do. It is exciting to ponder what we will discover, those

Acknowledgments

Work in S.Y.R.’s laboratory is funded by the US National Science Foundation (grants IOS-1026003, DBI-0640769, and MCB-1052348) and the US Department of Energy (grant BER-65472). M.M. is funded by the Max Planck Institute for Molecular Plant Physiology. We thank Tanya Berardini from TAIR, Rachael Huntley from UniProt, and Marcela Monaco from Gramene for their help with accessing GO annotations, and Lee Chae and Eva Huala for their comments on the manuscript.

Glossary

Bayesian network
a graphical representation of the conditional dependencies of nodes.
Cluster compactness
a measure for determining the degree of similarity of nodes in a cluster.
Cluster completeness
a measure of how many nodes with the same property are assigned to the same cluster.
Cluster connectedness
a measure of the density of the links between nodes in a cluster.
Cluster purity or homogeneity
a measure of the homogeneity of the characteristics of the nodes in a cluster.
Cluster stability
a measure

References (101)

  • M. Ashburner

    Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

    Nat. Genet.

    (2000)
  • G.P. Moss

    Enzyme Nomenclature

    (2013)
  • M.H. Saier

    The Transporter Classification Database: recent advances

    Nucleic Acids Res.

    (2009)
  • A. Pujar

    Whole-plant growth stage ontology for angiosperms and its application in plant biology

    Plant Physiol.

    (2006)
  • K. Ilic

    The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant

    Plant Physiol.

    (2007)
  • O. Thimm

    MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes

    Plant J.

    (2004)
  • S.Y. Rhee

    Use and misuse of the gene ontology annotations

    Nat. Rev. Genet.

    (2008)
  • I. Lee

    Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana

    Nat. Biotechnol.

    (2010)
  • G.W. Bassel

    Systems analysis of plant functional, transcriptional, physical interaction, and metabolic networks

    Plant Cell

    (2012)
  • M.E. Csete et al.

    Reverse engineering of biological complexity

    Science

    (2002)
  • A. Vazquez

    Global protein function prediction from protein–protein interaction networks

    Nat. Biotechnol.

    (2003)
  • I. Pagani

    The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata

    Nucleic Acids Res.

    (2012)
  • M.C. Schatz

    Current challenges in de novo plant genome sequencing and assembly

    Genome Biol.

    (2012)
  • S.F. Altschul

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • E. Quevillon

    InterProScan: protein domains identifier

    Nucleic Acids Res.

    (2005)
  • M.Y. Galperin

    Analogous enzymes: independent inventions in enzyme evolution

    Genome Res.

    (1998)
  • Y. Nakamura

    Rate and polarity of gene fusion and fission in Oryza sativa and Arabidopsis thaliana

    Mol. Biol. Evol.

    (2007)
  • J.M. Lee et al.

    Genomic gene clustering analysis of pathways in eukaryotes

    Genome Res.

    (2003)
  • H.Y. Chu

    From hormones to secondary metabolism: the emergence of metabolic gene clusters in plants

    Plant J.

    (2011)
  • M. Pellegrini

    Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

    Proc. Natl. Acad. Sci. U.S.A.

    (1999)
  • S. Gerdes

    Synergistic use of plant–prokaryote comparative genomics for functional annotations

    BMC Genomics

    (2011)
  • L.F. Wu

    Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters

    Nat. Genet.

    (2002)
  • M. Ryngajllo

    SLocX: predicting subcellular localization of Arabidopsis proteins leveraging gene expression data

    Front. Plant Sci.

    (2011)
  • S. Persson

    Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets

    Proc. Natl. Acad. Sci. U.S.A.

    (2005)
  • X. Han

    Co-expression analysis identifies CRC and AP1 the regulator of Arabidopsis fatty acid biosynthesis

    J. Integr. Plant Biol.

    (2012)
  • H. Maeda

    Prephenate aminotransferase directs plant phenylalanine biosynthesis via arogenate

    Nat. Chem. Biol.

    (2011)
  • S.P. Ficklin et al.

    Gene coexpression network alignment and conservation of gene modules between two grass species: maize and rice

    Plant Physiol.

    (2011)
  • S. Movahedi

    Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice

    Plant Physiol.

    (2011)
  • R.V. Patel

    BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species

    Plant J.

    (2012)
  • F.M. Giorgi

    Comparative study of RNA-seq- and microarray-derived coexpression networks in Arabidopsis thaliana

    Bioinformatics

    (2013)
  • Arabidopsis Interactome Mapping, C. (2011) Evidence for network evolution in an Arabidopsis interactome map. Science...
  • A.C. Gavin

    Proteome survey reveals modularity of the yeast cell machinery

    Nature

    (2006)
  • M. Meier

    Proteome-wide protein interaction measurements of bacterial proteins of unknown function

    Proc. Natl. Acad. Sci. U.S.A.

    (2013)
  • H. Hishigaki

    Assessment of prediction accuracy of protein function from protein–protein interaction data

    Yeast

    (2001)
  • C. Stark

    The BioGRID Interaction Database: 2011 update

    Nucleic Acids Res.

    (2011)
  • H. Huang

    Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps

    PLoS Comput. Biol.

    (2007)
  • G.T. Hart

    How complete are current yeast and human protein-interaction networks?

    Genome Biol.

    (2006)
  • H. Yu

    High-quality binary protein interaction map of the yeast interactome network

    Science

    (2008)
  • S. Bandyopadhyay

    Rewiring of genetic networks in response to DNA damage

    Science

    (2010)
  • M. Babu

    Genetic interaction maps in Escherichia coli reveal functional crosstalk among cell envelope biogenesis pathways

    PLoS Genet.

    (2011)
  • Cited by (177)

    • Crossroads in the evolution of plant specialized metabolism

      2023, Seminars in Cell and Developmental Biology
    • Functional genomics of Chlamydomonas reinhardtii

      2023, The Chlamydomonas Sourcebook: Volume 1: Introduction to Chlamydomonas and Its Laboratory Use
    • Ex vivo metabolomics—A hypothesis-free approach to identify native substrate(s) and product(s) of orphan enzymes

      2023, Methods in Enzymology
      Citation Excerpt :

      For instance, roughly 20% of uncharacterized proteins in Saccharomyces cerevisiae are predicted to be enzymes (Cohen, Kahana, & Schuldiner, 2022). Similarly, approximately 60% of protein coding genes in the model plant Arabidopsis thaliana are still lacking functional annotation based on experimental evidence (Rhee & Mutwil, 2014). Despite the highly developed molecular toolboxes available for the study of many of these model organisms, some of their enzymes have eluded characterization so far.

    View all citing articles on Scopus
    View full text