Trends in Biotechnology
ReviewProtein function prediction – the power of multiplicity
Introduction
Improving the coverage and accuracy of functional annotation has now become more important than the production of yet more ‘raw’ sequence data. The new driving forces in the life sciences, systems and medical biology, will require a complete picture of the cellular functional repertoire to ultimately link and alter micro- and macroscopic phenotypes.
The present multiplicity of data, methods and definitions of what ‘protein function’ actually means has equal potential to help and confuse the annotation process. To illustrate the former and help prevent the latter, this review tries to provide an ordered picture of the automated function prediction field.
We focus on accessible, competitive and actively maintained resources useful in the prediction of protein function (Table 1). Although there are a great many resources suitable for predicting molecular function (covered in the first part of the article), tools specifically aimed at the prediction of broader (process and pathway) functions are still very sparse. The second part of the article therefore also points at concepts and algorithms that could form the basis for such new resources. Owing to space limitations, we cannot cover the exciting fields of function inference from structure (see e.g. 1, 2) and prediction of subcellular localization (see [3]).
Section snippets
Functional annotation – how to capture it?
An important prerequisite for making maximum use of sequence and associated experimental data is a system of storing functional information in a way that is easy to interpret by human beings and consistent enough for large-scale computational studies. Several such systems have been developed, focusing on different aspects of protein function (Box 1).
In 2000, Rison and co-workers [4] carried out a comparison of existing annotation systems and concluded: ‘The power of such schemes will only be
Molecular function – reactions, substrates and activities
Traditionally, ‘protein function’ refers to the molecular function of a sequence, such as the catalytic activity of enzymes, the scaffolding activity of structural (e.g. cytoskeletal) proteins, the transport and signalling activities of transmembrane proteins, and so forth. This narrow aspect of function is solely determined by sequence and structure and corresponds to the ‘molecular function’ branch of GO (Box 1).
The easiest way to infer the molecular function of an uncharacterized sequence is
Similarity group methods
The results of a Basic Local Alignment Search Tool (BLAST) [8] search against public databases can provide a picture of how big or diverse (by the number and quality of hits) and how well characterized (by the descriptions of the hits) the family of an uncharacterized query sequence is. It seems intuitive that more, and more correct, annotations could be transferred by looking at all relatives in this list – not only the top hit. The dramatic increase in sequence data (hence more matches for
The phylogenomic approach
The accuracy of annotation transfer can be increased further by taking the evolutionary relationships within protein families into account. This addresses the difference between orthologous and paralogous relatives of a query sequence, that is, between relatives by speciation and relatives by gene duplication [12] (see Glossary).
The phylogenomic method [13] implements the following basic workflow: find all homologues of the query sequence and align them, build a phylogenetic tree, reconcile
Pattern-based methods
The methods mentioned above largely depend on whole-sequence homologues in public databases. Alternatively, several resources classify proteins by locally conserved sequence patterns, which often indicate the function(s) of the whole protein (e.g. active site motifs). Figure 1 provides an overview of the different types of patterns and respective resources. These focus on different levels of functional specificity, as reflected by pattern size and complexity. Note that protein domains are
Clustering approaches
Several resources try to cluster the known sequence space, whereby uncharacterized sequences can be functionally annotated by virtue of their clustering with characterized sequences. There are two principal types of this approach: clustering based on sequence similarity alone (homologues), and clustering based on supposed functional conservation (orthologues and inparalogues).
ProtoNet [38] is an ambitious representative of the first type, organized as a tree of sequence clusters. Using a
Machine learning methods
If a sequence of interest turns out to be an ‘ORFan’ (see Glossary), feature-based ML methods can sometimes provide valuable functional hints. Such approaches try to learn characteristic combinations of sequence features (or their intensities) that co-occur with specific functional assignments in a training set of known sequences. Classifiers constructed in this way, usually Support Vector Machines (SVMs) or neural networks, are then used to assign different functions to uncharacterized
So which server wins?
Figure 2 shows the annotation workflows of molecular function prediction methods. Clustering and similarity group methods only rely on one-to-one sequence comparisons, whereas pattern and phylogenomics approaches involve a multiple sequence alignment step. ML servers train their classifiers on intrinsic sequence features, not comparing sequences (dashed arrow in Figure 2). These differences translate to different strengths and weaknesses.
A sensible approach to molecular function prediction
Broader function – interactions, complexes and pathways
Protein–protein interaction (PPI) datasets are becoming increasingly rich and useful in delineating the biological processes, pathways and complexes that proteins take part in. There now is observable overlap (and informative deviation) in results between different types of low- and high-throughput experiments [55].
Not only experimental evidence can increase the confidence in the now omnipresent ‘fur balls’ of cellular networks. So-called ‘genomic context’ methods [56] are sequence-based,
Genomic context methods
Several computational approaches try to predict protein interactions from sequence data alone: gene fusion and gene neighbourhood analyses, phylogenetic profiling (PP) and tree-similarity methods. These vary in scope and specificity [58], and their combination yields the most promising results [59].
The gene fusion (also known as ‘Rosetta Stone’) approach tries to detect instances of a relatively rare evolutionary event: the fusion of genes of closely interacting proteins, resulting in a single
Network-based annotation transfer
Any interaction dataset, experimentally derived or predicted by the above methods, can be depicted as a network of nodes connected by edges. Nodes are cellular ‘players’ (such as proteins) and edges denote interactions between nodes.
Network-based annotation transfer algorithms share the assumption that the closer that two nodes are in the network, the more functionally similar they will be. This refers to broader function (cellular pathway or process), as opposed to molecular function (as in
Functional linkage networks
Individual PPI datasets still usually yield sparse and unreliable networks [78]. The integration of different types of interaction data into FLNs was first demonstrated by Marcotte and colleagues in 1999 [56] and has since become a promising approach (Figure 3).
FLN edge weights are integrated interaction probability values. This is thought to increase both coverage and precision: in some cases, several experimental and/or computational methods will support the same weak (true positive) link
Concluding remarks
The combination of sequence- and network-based function-transfer approaches, as discussed here, is a promising field for future studies. This is due to their complementary nature: while sequence (and structural) similarity can provide a safe basis for molecular function transfer, interactions hint at the pathways and the processes in which uncharacterized proteins participate [84].
Although there is a need for network-based annotation tools and servers, the improvement of sequence-based methods
Acknowledgements
R.R. was funded by the European Union ENFIN (Experimental Network for Functional Integration) Network of Excellence.
Glossary
- Inparalogues
- paralogues in the same genome that arose from a gene duplication event after speciation. Owing to a lack of time to diverge, inparalogues are thought to be functionally more similar than outparalogues and can be seen as ‘co-orthologues’ to a single orthologous sequence in another genome (as explained in [12]).
- ORFan sequences
- proteins or protein-encoding regions in a genome that have no detectable sequence similarity to proteins found in other genomes.
- Orthologues
- homologous genes in
References (90)
Exploring the structure and function paradigm
Curr. Opin. Struct. Biol.
(2008)- et al.
Orthology, paralogy and proposed classification for paralog subtypes
Trends Genet.
(2002) - et al.
How well is enzyme function conserved as a function of pairwise sequence identity?
J. Mol. Biol.
(2003) Enzyme function less conserved than anticipated
J. Mol. Biol.
(2002)- et al.
EzyPred: a top-down approach for predicting enzyme functional classes and subclasses
Biochem. Biophys. Res. Commun.
(2007) - et al.
Comparative interactomics: comparing apples and pears?
Trends Biotechnol.
(2007) - et al.
Structure-based function prediction: approaches and applications
Brief. Funct. Genomic. Proteomic.
(2008) The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation
Brief. Funct. Genomic. Proteomic.
(2008)Comparison of functional annotation schemes for genomes
Funct. Integr. Genomics
(2000)Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
(2000)