Journal of Molecular Biology
Prediction of Human Protein Function from Post-translational Modifications and Localization Features
Introduction
Out of the 35,000 to 50,000 genes believed to be present in the human genome, no more than 40–60% can be assigned a functional role based on homology to proteins with known function.1., 2. Traditionally, protein function has been related directly to the three-dimensional structure of the polypeptide chain, which currently, for an arbitrary sequence, is quite hard to compute.3 The method presented here operates in the “feature” space of all sequences, and is therefore complementary to methods that are based on alignment and the inherent, position-by-position quantification of similarity between two sequences. The method does not require knowledge of gene expression,4 gene fusion and/or phylogenetic profiles.5., 6., 7., 8. Although the latter type of method does not rely on finding direct matches to proteins of known function, it does require sequence similarity to other candidates that can be phylogenetically linked to a protein of known function.
For any function assignment method, the ability to correctly predict the relationship depends strongly on the function classification scheme used. One would, for example, not expect that a method based on co-regulation will work well for a category like “enzyme”, since enzymes and the genes coding for their substrates or substrate transporters often may display strong co-regulation. A similar argument holds true for the phylogenetic profile method.
Our approach to function prediction is based on the fact that a protein is not alone when performing its biological task. It will have to operate using the same cellular machinery for modification and sorting as all the other proteins do. Essential types of post-translational modifications (PTMs) include: N- and O-glycosylation, (S/T/Y) phosphorylation, and cleavage of N-terminal signal peptides controlling the entry to the secretory pathway, but hundreds of other types of modifications exist9 (a subset of these will be present in any given organism). Many of the PTMs are enabled by local consensus sequence motifs, while others are characterized by more complex patterns of correlation between the amino acids.10
This suggests an alternative approach for function prediction, as one may expect that proteins performing similar functions would share some attributes even though they are not at all related at the global level of primary structure. As several predictive methods for PTMs have been constructed (R.G., S.B., unpublished results),10., 11., 12., 13. a function prediction method based on such attributes can be applied to all proteins where the sequence is known.
Section snippets
Results and Discussion
The ProtFun method described here integrates (using a neural network approach) 14 individual attribute predictions and calculated sequence statistics (out of more than 25 tested for discriminative value). The integrated method predicts functional categories as defined originally by Riley for Escherichia coli, that in modified form has been used to describe many entire genomes in recent publications.1., 2., 14., 15. In addition, it predicts whether a sequence is likely to function as an enzyme,
Conclusion
The method presented here has the ability to transfer functional information between sequences that are far apart in sequence space. Not even the primary structures of the individual features (which are integrated by the method) need to be alike, or be related by evolution. The ProtFun method performs its non-linear classification in the feature space defined by 14 predicted and calculated attributes, which have been selected by the approach (out of more than 25 different attributes considered
Data sets and functional class assignment
Classes of cellular function were defined after the 14 class classification originally proposed for the E. coli genome14 and later extended by the TIGR group. The automatic class assignment to sequences was made by an extension of the Euclid system performing linguistic analysis of SWISS-PROT keywords.25 The system detects sequences similar to a query sequence by a BLAST search in the SWISS-PROT database and extracts common keywords from the entries. As we work with sequences from SWISS-PROT
References (35)
- et al.
Functional discovery via a compendium of expression profiles
Cell
(2000) - et al.
Sequence and structure-based prediction of eukaryotic protein phosphorylation sites
J. Mol. Biol.
(1999) The regulation of protein function by multisite phosphorylation—a 25 year update
Trends Biochem. Sci.
(2000)- et al.
PEST sequences and regulation by proteolysis
Trends Biochem. Sci.
(1996) Protein secondary structure prediction based on position-specific scoring matrices
J. Mol. Biol.
(1999)Copper and prion disease
Brain Res. Bull.
(2001)- et al.
Prediction of human mRNA donor and acceptor sites from the DNA sequence
J. Mol. Biol.
(1991) Non-globular domains in protein sequences: automated segmentation using complexity measures
Comput. Chem.
(1994)- et al.
Predicting transmembrane protein topology with a hidden markov model: application to complete genomes
J. Mol. Biol.
(2001) Initial sequencing and analysis of the human genome
Nature
(2001)