Review
Protein function prediction – the power of multiplicity

https://doi.org/10.1016/j.tibtech.2009.01.002Get rights and content

Advances in experimental and computational methods have quietly ushered in a new era in protein function annotation. This ‘age of multiplicity’ is marked by the notion that only the use of multiple tools, multiple evidence and considering the multiple aspects of function can give us the broad picture that 21st century biology will need to link and alter micro- and macroscopic phenotypes. It might also help us to undo past mistakes by removing errors from our databases and prevent us from producing more. On the downside, multiplicity is often confusing. We therefore systematically review methods and resources for automated protein function prediction, looking at individual (biochemical) and contextual (network) functions, respectively.

Introduction

Improving the coverage and accuracy of functional annotation has now become more important than the production of yet more ‘raw’ sequence data. The new driving forces in the life sciences, systems and medical biology, will require a complete picture of the cellular functional repertoire to ultimately link and alter micro- and macroscopic phenotypes.

The present multiplicity of data, methods and definitions of what ‘protein function’ actually means has equal potential to help and confuse the annotation process. To illustrate the former and help prevent the latter, this review tries to provide an ordered picture of the automated function prediction field.

We focus on accessible, competitive and actively maintained resources useful in the prediction of protein function (Table 1). Although there are a great many resources suitable for predicting molecular function (covered in the first part of the article), tools specifically aimed at the prediction of broader (process and pathway) functions are still very sparse. The second part of the article therefore also points at concepts and algorithms that could form the basis for such new resources. Owing to space limitations, we cannot cover the exciting fields of function inference from structure (see e.g. 1, 2) and prediction of subcellular localization (see [3]).

Section snippets

Functional annotation – how to capture it?

An important prerequisite for making maximum use of sequence and associated experimental data is a system of storing functional information in a way that is easy to interpret by human beings and consistent enough for large-scale computational studies. Several such systems have been developed, focusing on different aspects of protein function (Box 1).

In 2000, Rison and co-workers [4] carried out a comparison of existing annotation systems and concluded: ‘The power of such schemes will only be

Molecular function – reactions, substrates and activities

Traditionally, ‘protein function’ refers to the molecular function of a sequence, such as the catalytic activity of enzymes, the scaffolding activity of structural (e.g. cytoskeletal) proteins, the transport and signalling activities of transmembrane proteins, and so forth. This narrow aspect of function is solely determined by sequence and structure and corresponds to the ‘molecular function’ branch of GO (Box 1).

The easiest way to infer the molecular function of an uncharacterized sequence is

Similarity group methods

The results of a Basic Local Alignment Search Tool (BLAST) [8] search against public databases can provide a picture of how big or diverse (by the number and quality of hits) and how well characterized (by the descriptions of the hits) the family of an uncharacterized query sequence is. It seems intuitive that more, and more correct, annotations could be transferred by looking at all relatives in this list – not only the top hit. The dramatic increase in sequence data (hence more matches for

The phylogenomic approach

The accuracy of annotation transfer can be increased further by taking the evolutionary relationships within protein families into account. This addresses the difference between orthologous and paralogous relatives of a query sequence, that is, between relatives by speciation and relatives by gene duplication [12] (see Glossary).

The phylogenomic method [13] implements the following basic workflow: find all homologues of the query sequence and align them, build a phylogenetic tree, reconcile

Pattern-based methods

The methods mentioned above largely depend on whole-sequence homologues in public databases. Alternatively, several resources classify proteins by locally conserved sequence patterns, which often indicate the function(s) of the whole protein (e.g. active site motifs). Figure 1 provides an overview of the different types of patterns and respective resources. These focus on different levels of functional specificity, as reflected by pattern size and complexity. Note that protein domains are

Clustering approaches

Several resources try to cluster the known sequence space, whereby uncharacterized sequences can be functionally annotated by virtue of their clustering with characterized sequences. There are two principal types of this approach: clustering based on sequence similarity alone (homologues), and clustering based on supposed functional conservation (orthologues and inparalogues).

ProtoNet [38] is an ambitious representative of the first type, organized as a tree of sequence clusters. Using a

Machine learning methods

If a sequence of interest turns out to be an ‘ORFan’ (see Glossary), feature-based ML methods can sometimes provide valuable functional hints. Such approaches try to learn characteristic combinations of sequence features (or their intensities) that co-occur with specific functional assignments in a training set of known sequences. Classifiers constructed in this way, usually Support Vector Machines (SVMs) or neural networks, are then used to assign different functions to uncharacterized

So which server wins?

Figure 2 shows the annotation workflows of molecular function prediction methods. Clustering and similarity group methods only rely on one-to-one sequence comparisons, whereas pattern and phylogenomics approaches involve a multiple sequence alignment step. ML servers train their classifiers on intrinsic sequence features, not comparing sequences (dashed arrow in Figure 2). These differences translate to different strengths and weaknesses.

A sensible approach to molecular function prediction

Broader function – interactions, complexes and pathways

Protein–protein interaction (PPI) datasets are becoming increasingly rich and useful in delineating the biological processes, pathways and complexes that proteins take part in. There now is observable overlap (and informative deviation) in results between different types of low- and high-throughput experiments [55].

Not only experimental evidence can increase the confidence in the now omnipresent ‘fur balls’ of cellular networks. So-called ‘genomic context’ methods [56] are sequence-based,

Genomic context methods

Several computational approaches try to predict protein interactions from sequence data alone: gene fusion and gene neighbourhood analyses, phylogenetic profiling (PP) and tree-similarity methods. These vary in scope and specificity [58], and their combination yields the most promising results [59].

The gene fusion (also known as ‘Rosetta Stone’) approach tries to detect instances of a relatively rare evolutionary event: the fusion of genes of closely interacting proteins, resulting in a single

Network-based annotation transfer

Any interaction dataset, experimentally derived or predicted by the above methods, can be depicted as a network of nodes connected by edges. Nodes are cellular ‘players’ (such as proteins) and edges denote interactions between nodes.

Network-based annotation transfer algorithms share the assumption that the closer that two nodes are in the network, the more functionally similar they will be. This refers to broader function (cellular pathway or process), as opposed to molecular function (as in

Functional linkage networks

Individual PPI datasets still usually yield sparse and unreliable networks [78]. The integration of different types of interaction data into FLNs was first demonstrated by Marcotte and colleagues in 1999 [56] and has since become a promising approach (Figure 3).

FLN edge weights are integrated interaction probability values. This is thought to increase both coverage and precision: in some cases, several experimental and/or computational methods will support the same weak (true positive) link

Concluding remarks

The combination of sequence- and network-based function-transfer approaches, as discussed here, is a promising field for future studies. This is due to their complementary nature: while sequence (and structural) similarity can provide a safe basis for molecular function transfer, interactions hint at the pathways and the processes in which uncharacterized proteins participate [84].

Although there is a need for network-based annotation tools and servers, the improvement of sequence-based methods

Acknowledgements

R.R. was funded by the European Union ENFIN (Experimental Network for Functional Integration) Network of Excellence.

Glossary

Inparalogues
paralogues in the same genome that arose from a gene duplication event after speciation. Owing to a lack of time to diverge, inparalogues are thought to be functionally more similar than outparalogues and can be seen as ‘co-orthologues’ to a single orthologous sequence in another genome (as explained in [12]).
ORFan sequences
proteins or protein-encoding regions in a genome that have no detectable sequence similarity to proteins found in other genomes.
Orthologues
homologous genes in

References (90)

  • O. Bodenreider

    The Unified Medical Language System (UMLS): integrating biomedical terminology

    Nucleic Acids Res.

    (2004)
  • O. Bodenreider

    Biomedical ontologies in action: role in knowledge management, data integration and decision support

    Yearb Med. Inform.

    (2008)
  • S.F. Altschul

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • D.M. Martin

    GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes

    BMC Bioinformatics

    (2004)
  • T. Hawkins

    PFP: automated prediction of gene ontology functional annotations with confidence scores using protein sequence data

    Proteins

    (2008)
  • C.E. Jones

    GOSLING: a rule-based protein annotator using BLAST and GO

    Bioinformatics

    (2008)
  • J.A. Eisen

    A phylogenomic study of the MutS family of proteins

    Nucleic Acids Res.

    (1998)
  • M. Goodman

    Fitting the gene lineage into its species lineage. A parsimony strategy illustrated by cladograms constructed from globin sequences

    Syst. Zool.

    (1979)
  • B.E. Engelhardt

    Protein molecular function prediction by Bayesian phylogenomics

    PLOS Comput. Biol.

    (2005)
  • R.D. Finn

    The Pfam protein families database

    Nucleic Acids Res.

    (2008)
  • A. Jocker

    Protein function prediction and annotation in an integrated environment powered by web services (AFAWE)

    Bioinformatics

    (2008)
  • A. Godzik

    Computational protein function prediction: are we making progress?

    Cell. Mol. Life Sci.

    (2007)
  • G.A. Reeves

    The Protein Feature Ontology: a tool for the unification of protein feature annotations

    Bioinformatics

    (2008)
  • N.J. Mulder

    In silico characterization of proteins: UniProt, InterPro and Integr8

    Mol. Biotechnol.

    (2008)
  • N. Hulo

    The 20 years of PROSITE

    Nucleic Acids Res.

    (2008)
  • T.K. Attwood

    PRINTS and its automatic supplement, prePRINTS

    Nucleic Acids Res.

    (2003)
  • D. Wilson

    The SUPERFAMILY database in 2007: families and functions

    Nucleic Acids Res.

    (2007)
  • C. Bru

    The ProDom database of protein domain families: more emphasis on 3D

    Nucleic Acids Res.

    (2005)
  • I. Letunic

    SMART 5: domains in the context of genomes and networks

    Nucleic Acids Res.

    (2006)
  • C. Yeats

    Gene3D: comprehensive structural and functional annotation of genomes

    Nucleic Acids Res.

    (2008)
  • H. Mi

    The PANTHER database of protein families, subfamilies, functions and pathways

    Nucleic Acids Res.

    (2005)
  • C.H. Wu

    PIRSF: family classification system at the Protein Information Resource

    Nucleic Acids Res.

    (2004)
  • D.H. Haft

    The TIGRFAMs database of protein families

    Nucleic Acids Res.

    (2003)
  • A. Andreeva

    Data growth and its impact on the SCOP database: new developments

    Nucleic Acids Res.

    (2008)
  • A.L. Cuff

    The CATH classification revisited – architectures reviewed and new ways to characterize structural divergence in superfamilies

    Nucleic Acids Res.

    (2008)
  • Addou, S., et al. (2008) Domain-based and family-specific sequence identity thresholds increase the levels of reliable...
  • C. Yu

    Genome-wide enzyme annotation with precision control: catalytic families (CatFam) databases

    Proteins

    (2008)
  • A.K. Arakaki

    High precision multi-genome scale reannotation of enzyme function by EFICAz

    BMC Genomics

    (2006)
  • C. Claudel-Renard

    Enzyme-specific profiles for genome annotation: PRIAM

    Nucleic Acids Res.

    (2003)
  • N. Kaplan

    ProtoNet 4.0: a hierarchical classification of one million protein sequences

    Nucleic Acids Res.

    (2005)
  • Y. Loewenstein

    Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

    Bioinformatics

    (2008)
  • O. Sasson

    Functional annotation prediction: all for one and one for all

    Protein Sci.

    (2006)
  • R. Petryszak

    The predictive power of the CluSTr database

    Bioinformatics

    (2005)
  • P.J. Kersey

    The International Protein Index: an integrated database for proteomics experiments

    Proteomics

    (2004)
  • L.J. Jensen

    eggNOG: automated construction and annotation of orthologous groups of genes

    Nucleic Acids Res.

    (2008)
  • Cited by (0)

    View full text