Prediction of Human Protein Function from Post-translational Modifications and Localization Features

https://doi.org/10.1016/S0022-2836(02)00379-0Get rights and content

Abstract

We have developed an entirely sequence-based method that identifies and integrates relevant features that can be used to assign proteins of unknown function to functional classes, and enzyme categories for enzymes. We show that strategies for the elucidation of protein function may benefit from a number of functional attributes that are more directly related to the linear sequence of amino acids, and hence easier to predict, than protein structure. These attributes include features associated with post-translational modifications and protein sorting, but also much simpler aspects such as the length, isoelectric point and composition of the polypeptide chain.

Introduction

Out of the 35,000 to 50,000 genes believed to be present in the human genome, no more than 40–60% can be assigned a functional role based on homology to proteins with known function.1., 2. Traditionally, protein function has been related directly to the three-dimensional structure of the polypeptide chain, which currently, for an arbitrary sequence, is quite hard to compute.3 The method presented here operates in the “feature” space of all sequences, and is therefore complementary to methods that are based on alignment and the inherent, position-by-position quantification of similarity between two sequences. The method does not require knowledge of gene expression,4 gene fusion and/or phylogenetic profiles.5., 6., 7., 8. Although the latter type of method does not rely on finding direct matches to proteins of known function, it does require sequence similarity to other candidates that can be phylogenetically linked to a protein of known function.

For any function assignment method, the ability to correctly predict the relationship depends strongly on the function classification scheme used. One would, for example, not expect that a method based on co-regulation will work well for a category like “enzyme”, since enzymes and the genes coding for their substrates or substrate transporters often may display strong co-regulation. A similar argument holds true for the phylogenetic profile method.

Our approach to function prediction is based on the fact that a protein is not alone when performing its biological task. It will have to operate using the same cellular machinery for modification and sorting as all the other proteins do. Essential types of post-translational modifications (PTMs) include: N- and O-glycosylation, (S/T/Y) phosphorylation, and cleavage of N-terminal signal peptides controlling the entry to the secretory pathway, but hundreds of other types of modifications exist9 (a subset of these will be present in any given organism). Many of the PTMs are enabled by local consensus sequence motifs, while others are characterized by more complex patterns of correlation between the amino acids.10

This suggests an alternative approach for function prediction, as one may expect that proteins performing similar functions would share some attributes even though they are not at all related at the global level of primary structure. As several predictive methods for PTMs have been constructed (R.G., S.B., unpublished results),10., 11., 12., 13. a function prediction method based on such attributes can be applied to all proteins where the sequence is known.

Section snippets

Results and Discussion

The ProtFun method described here integrates (using a neural network approach) 14 individual attribute predictions and calculated sequence statistics (out of more than 25 tested for discriminative value). The integrated method predicts functional categories as defined originally by Riley for Escherichia coli, that in modified form has been used to describe many entire genomes in recent publications.1., 2., 14., 15. In addition, it predicts whether a sequence is likely to function as an enzyme,

Conclusion

The method presented here has the ability to transfer functional information between sequences that are far apart in sequence space. Not even the primary structures of the individual features (which are integrated by the method) need to be alike, or be related by evolution. The ProtFun method performs its non-linear classification in the feature space defined by 14 predicted and calculated attributes, which have been selected by the approach (out of more than 25 different attributes considered

Data sets and functional class assignment

Classes of cellular function were defined after the 14 class classification originally proposed for the E. coli genome14 and later extended by the TIGR group. The automatic class assignment to sequences was made by an extension of the Euclid system performing linguistic analysis of SWISS-PROT keywords.25 The system detects sequences similar to a query sequence by a BLAST search in the SWISS-PROT database and extracts common keywords from the entries. As we work with sequences from SWISS-PROT

References (35)

  • J.C. Venter et al.

    The sequence of the human genome

    Science

    (2001)
  • Lesk, A., Conte, L., Hubbard, T. (2001). Assessment of novel fold targets in CASP4: predictions of three-dimensional...
  • M. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Natl Acad. Sci. USA

    (1998)
  • E. Marcotte et al.

    A combined algorithm for genome-wide prediction of protein function

    Nature

    (1999)
  • E. Marcotte et al.

    Detecting protein function and protein–protein interactions from genome sequences

    Science

    (1999)
  • M. Pellegrini et al.

    Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

    Proc. Natl Acad. Sci. USA

    (1999)
  • J.S. Garavelli et al.

    The RESID database of protein structure modifications and the nrl-3d sequence-structure database

    Nucl. Acids Res.

    (2001)
  • Cited by (0)

    These two authors contributed equally to this work.

    2

    http:www.cbs.dtu.dk

    View full text