Trends in Plant Science
ReviewSpecial Issue: Systems BiologyTowards revealing the functions of all genes in plants
Section snippets
How little we know
The elucidation of the genome sequence of many organisms, one of the outstanding achievements of our generation, confirmed what most biologists already suspected – that we know little about what most genes do. For example, approximately 40% of Arabidopsis (Arabidopsis thaliana, thale cress) and 1% of rice (Oryza sativa) protein-coding genes have had some aspect of their functions annotated based on experimental evidence (Figure 1) 1, 2. Moreover, we know about the biochemical activity,
What's in a function?
Gene function can mean different things to different people. Therefore, it is important to use controlled vocabularies for defining the function explicitly [3]. It is also helpful to use the same vocabularies for describing functions to maximize comparability across species. The Open Biological Ontologies consortium provides a set of guidelines for creating and improving ontologies and a forum for sharing them [4]. The Gene Ontology (GO) vocabulary system exemplifies the minimal information
What's in a network?
Just as a function can have different meanings, a network can also have different meanings and purposes in biology. Molecular networks that have been generated can be grouped into three categories: associational, informational, and mechanistic. Associational networks are akin to social networks such as Facebook or LinkedIn. We can guess things about a gene (or person) based on other genes (or people) it is connected to. For example, properties of genes can be identified from omics data and used
Omics data used in inferring gene function
Omics data can help elucidate functions of gene products either by direct measurement or usage in inference programs. Typically, a particular type of omics data is useful for elucidating functions in a particular GO domain. For example, sequencing peptides from isolated subcellular compartments is useful for assigning gene products to cellular components [15], but is less valuable for assigning to biological processes. In addition, similarity between protein sequences enables molecular
Process of systematic gene function elucidation
Most of the omics data can be used to build co-function networks that are useful for inferring biological processes. The process of biological process inference using co-function networks can be broken down into seven steps, as shown in Figure 2. Typically, function inference uses the guilt-by-association concept that tries to find similarity between characterized and uncharacterized genes based on some shared feature, and transfers the annotation from knowledge donor (gene of known function)
Integrating co-function networks
There are two advantages in integrating different types of omics data to construct co-function networks [57]. First, one type of data often reveals only one domain of gene function. Therefore, combining data types can increase prediction coverage. Second, a predicted functional association between two genes is more likely to be true if it is supported by multiple, independent data sources. Various data types have been integrated and used for biological process prediction in Arabidopsis 12, 54.
Predicting and validating gene function
The biological processes of genes can be inferred from co-function networks either by using the enriched (statistically overrepresented) functions of network neighbors or by clustering genes and identifying the enriched or majority functions of the genes within a cluster 62, 67 (Figure 2E). Clustering techniques largely fall into three categories: hierarchical, partitioning, and density-based (reviewed in [68]). Popular algorithms include Markov Cluster Algorithm (MCL) for its efficiency and
Caveats for gene function prediction
There are some caveats involved in function inference, which we will briefly discuss to help scientists identify which predictions are likely to be less reliable, and indicate the areas that are likely to be the focus of future research.
Where do we go from here?
Network-based function inference, despite having some caveats, holds great promise for accelerating gene function discovery in plants. However, there are some bottlenecks that we must overcome to achieve the goal of understanding the function of all genes in a plant genome.
Concluding remarks
Although network-based gene function prediction has been an active area of research for the past 15 years, its use in plant science has been limited. To exploit this underused technology we need, (i) more data; (ii) better assessment of data and tool quality; (iii) easy access to the data and tools; and (iv) high-throughput experimental validation. Many discoveries in plant science were made without knowing what most of the genes do. It is exciting to ponder what we will discover, those
Acknowledgments
Work in S.Y.R.’s laboratory is funded by the US National Science Foundation (grants IOS-1026003, DBI-0640769, and MCB-1052348) and the US Department of Energy (grant BER-65472). M.M. is funded by the Max Planck Institute for Molecular Plant Physiology. We thank Tanya Berardini from TAIR, Rachael Huntley from UniProt, and Marcela Monaco from Gramene for their help with accessing GO annotations, and Lee Chae and Eva Huala for their comments on the manuscript.
Glossary
- Bayesian network
- a graphical representation of the conditional dependencies of nodes.
- Cluster compactness
- a measure for determining the degree of similarity of nodes in a cluster.
- Cluster completeness
- a measure of how many nodes with the same property are assigned to the same cluster.
- Cluster connectedness
- a measure of the density of the links between nodes in a cluster.
- Cluster purity or homogeneity
- a measure of the homogeneity of the characteristics of the nodes in a cluster.
- Cluster stability
- a measure
References (101)
- et al.
Recent progress in protein subcellular location prediction
Anal. Biochem.
(2007) - et al.
How well is enzyme function conserved as a function of pairwise sequence identity?
J. Mol. Biol.
(2003) Enzyme function less conserved than anticipated
J. Mol. Biol.
(2002)A systematic mammalian genetic interaction map reveals pathways underlying ricin susceptibility
Cell
(2013)Network-based function prediction and interactomics: the case for metabolic enzymes
Metab. Eng.
(2011)Evolutionary constraints on hub and non-hub proteins in human protein interaction network: insight from protein connectivity and intrinsic disorder
Gene
(2009)The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools
Nucleic Acids Res.
(2012)Gramene database in 2010: updates and extensions
Nucleic Acids Res.
(2011)- et al.
Ontologies in biology: design, applications and future challenges
Nat. Rev. Genet.
(2004) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration
Nat. Biotechnol.
(2007)
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
Enzyme Nomenclature
The Transporter Classification Database: recent advances
Nucleic Acids Res.
Whole-plant growth stage ontology for angiosperms and its application in plant biology
Plant Physiol.
The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant
Plant Physiol.
MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes
Plant J.
Use and misuse of the gene ontology annotations
Nat. Rev. Genet.
Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana
Nat. Biotechnol.
Systems analysis of plant functional, transcriptional, physical interaction, and metabolic networks
Plant Cell
Reverse engineering of biological complexity
Science
Global protein function prediction from protein–protein interaction networks
Nat. Biotechnol.
The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata
Nucleic Acids Res.
Current challenges in de novo plant genome sequencing and assembly
Genome Biol.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
InterProScan: protein domains identifier
Nucleic Acids Res.
Analogous enzymes: independent inventions in enzyme evolution
Genome Res.
Rate and polarity of gene fusion and fission in Oryza sativa and Arabidopsis thaliana
Mol. Biol. Evol.
Genomic gene clustering analysis of pathways in eukaryotes
Genome Res.
From hormones to secondary metabolism: the emergence of metabolic gene clusters in plants
Plant J.
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
Proc. Natl. Acad. Sci. U.S.A.
Synergistic use of plant–prokaryote comparative genomics for functional annotations
BMC Genomics
Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters
Nat. Genet.
SLocX: predicting subcellular localization of Arabidopsis proteins leveraging gene expression data
Front. Plant Sci.
Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets
Proc. Natl. Acad. Sci. U.S.A.
Co-expression analysis identifies CRC and AP1 the regulator of Arabidopsis fatty acid biosynthesis
J. Integr. Plant Biol.
Prephenate aminotransferase directs plant phenylalanine biosynthesis via arogenate
Nat. Chem. Biol.
Gene coexpression network alignment and conservation of gene modules between two grass species: maize and rice
Plant Physiol.
Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice
Plant Physiol.
BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species
Plant J.
Comparative study of RNA-seq- and microarray-derived coexpression networks in Arabidopsis thaliana
Bioinformatics
Proteome survey reveals modularity of the yeast cell machinery
Nature
Proteome-wide protein interaction measurements of bacterial proteins of unknown function
Proc. Natl. Acad. Sci. U.S.A.
Assessment of prediction accuracy of protein function from protein–protein interaction data
Yeast
The BioGRID Interaction Database: 2011 update
Nucleic Acids Res.
Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps
PLoS Comput. Biol.
How complete are current yeast and human protein-interaction networks?
Genome Biol.
High-quality binary protein interaction map of the yeast interactome network
Science
Rewiring of genetic networks in response to DNA damage
Science
Genetic interaction maps in Escherichia coli reveal functional crosstalk among cell envelope biogenesis pathways
PLoS Genet.
Cited by (177)
Charting plant gene functions in the multi-omics and single-cell era
2023, Trends in Plant ScienceCrossroads in the evolution of plant specialized metabolism
2023, Seminars in Cell and Developmental BiologyFunctional genomics of Chlamydomonas reinhardtii
2023, The Chlamydomonas Sourcebook: Volume 1: Introduction to Chlamydomonas and Its Laboratory UseEx vivo metabolomics—A hypothesis-free approach to identify native substrate(s) and product(s) of orphan enzymes
2023, Methods in EnzymologyCitation Excerpt :For instance, roughly 20% of uncharacterized proteins in Saccharomyces cerevisiae are predicted to be enzymes (Cohen, Kahana, & Schuldiner, 2022). Similarly, approximately 60% of protein coding genes in the model plant Arabidopsis thaliana are still lacking functional annotation based on experimental evidence (Rhee & Mutwil, 2014). Despite the highly developed molecular toolboxes available for the study of many of these model organisms, some of their enzymes have eluded characterization so far.
Systems and strategies for plant protein expression
2023, Methods in Enzymology