Improved Prediction of Signal Peptides: SignalP 3.0

https://doi.org/10.1016/j.jmb.2004.05.028Get rights and content

Abstract

We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, combined with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions has increased notably for all three organism groups, eukaryotes, Gram-negative and Gram-positive bacteria. The accuracy of cleavage site prediction has increased in the range 6–17% over the previous version, whereas the signal peptide discrimination improvement is mainly due to the elimination of false-positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has been benchmarked against other available methods. Predictions can be made at the publicly available web server http://www.cbs.dtu.dk/services/SignalP/

Introduction

Numerous attempts to predict the correct subcellular location of proteins using machine learning techniques have been developed.1., 2., 3., 4., 5., 6., 7., 8., 9. Computational methods for prediction of N-terminal signal peptides were published around 20 years ago, initially using a weight matrix approach.1., 2. Development of prediction methods shifted to machine learning algorithms in the mid 1990s,10., 11. with a significant increase in performance.12 SignalP, one of the currently most used methods, predicts the presence of signal peptidase I cleavage sites. For signal peptidase II cleavage sites found in lipoproteins, the LipoP predictor has been constructed.13 SignalP produces both classification and cleavage site assignment, while most of the other methods classify proteins as secretory or non-secretory.

A consistent assessment of the predictive performance requires a reliable benchmark dataset. This is particularly important in this area, where the predictive performance is approaching the performance calculated from interpretation of experimental data, which is not always perfect. Incorrect annotation of signal peptide cleavage sites in the databases stems from trivial database errors, and from peptide sequencing, where it may be hard to control the level of post-processing of the protein by other peptidases after the signal peptidase I has made its initial cleavage. Such post-processing typically leads to cleavage site assignments shifted downstream relative to the true signal peptidase I cleavage site.

In the process of training the new version of SignalP we have generated a new, thoroughly curated dataset based on the extraction and redundancy reduction method published earlier.14 Other methods were used for cleaning the new dataset, and we found a surprisingly high error rate in Swiss-Prot, where, for example, of the order of 7% of the Gram-positive entries had either wrong cleavage site position and/or wrong annotation of the experimental evidence. Also, we found many errors in a previously used benchmark set (stemming from automatic extraction from Swiss-Prot),12 and it appears that some programs are in fact better than the performance reported (predictions are correct, while feature annotation is incorrect). For comparison, we made use of this independent benchmark dataset that was used initially for evaluation of five different signal peptide predictors.12

In the new version of SignalP we have introduced novel amino acid composition units as well as sequence position units in the neural network input layer in order to obtain better performance. Moreover, we have changed the window sizes slightly compared to the previous version. We have used fivefold cross-validation tests for direct comparison to the previous version of SignalP.10 In the previous version of SignalP a combination score, Y, was created from the cleavage site score, C, and the signal peptide score, S, and used to obtain a better prediction of the position of the cleavage site. In the new version, we also use the C-score to obtain a better discrimination between secreted and non-secreted sequences, and have constructed a new D-score for this classification task. The architecture of the hidden Markov model (HMM) SignalP has not changed, but the models have been retrained on the new data set, and have increased their performance significantly.

Section snippets

Generation of data sets

As the predictive performance of the earlier SignalP method was quite high, assessment of potential improvements is critically dependent on the quality of the data annotation. We generated a new positive signal peptide data set from Swiss-Prot15 release 40.0, retaining the negative dataset extracted from the previous work. The method for redundancy reduction was the same as in the previous work14, and was based on the reduction principle developed by Hobohm et al.16 Our final positive signal

Conclusion

We present new versions of SignalP, based on an expanded, highly curated dataset. The architecture of the HMM-based version was unchanged, while the neural network scheme was improved by including information about the amino acid composition of the precursor protein as well as the position of the sliding window. Furthermore, we optimized the window sizes by testing all possible combinations of asymmetric and symmetric input windows up to a total input of 51 amino acid residues. These were

Data set extraction

All sequence data were extracted from Swiss-Prot15 release 40.0. A total of 12,975 entries with the keyword SIGNAL were found. The dataset was split into three species-specific groups: eukaryotes, Gram-negative prokaryotes and Gram-positive prokaryotes. We excluded all archaeal sequences. Non-experimentally verified signal peptides that had POTENTIAL or HYPOTHETICAL stated in the keyword line were removed. Furthermore, any phage, viral or eukaryote organelle-encoded proteins were excluded.

Acknowledgements

We thank Anders Krogh for use of his HMM software. This work was supported by grants from the Danish National Research Foundation, the Danish Natural Science Research Council, the Danish Center for Scientific Computing, and by a grant from Novozymes A/S (to J.D.B.).

References (44)

  • A. Reinhardt et al.

    Using neural networks for prediction of the subcellular location of proteins

    Nucl. Acids Res.

    (1998)
  • S. Hua et al.

    Support vector machine approach for protein subcellular localization prediction

    Bioinformatics

    (2001)
  • J.P. Vert

    Proceedings of the Pacific Symposium on Biocomputing

    (2002)
  • P. Fariselli et al.

    SPEPlip: the detection of signal peptide and lipoprotein cleavage sites

    Bioinformatics

    (2003)
  • J.L. Gardy et al.

    PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria

    Nucl. Acids Res.

    (2003)
  • Z. Zhang et al.

    A profile hidden Markov model for signal peptides generated by HMMER

    Bioinformatics

    (2003)
  • H. Nielsen et al.

    Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

    Protein Eng.

    (1997)
  • H. Nielsen et al.

    Prediction of signal peptides and signal anchors by a hidden Markov model

    Proc. Int. Cong. Intell. Syst. Mol. Biol.

    (1998)
  • K.M. Menne et al.

    A comparison of signal sequence prediction methods using a test set of signal peptides

    Bioinformatics

    (2000)
  • A.S. Juncker et al.

    Prediction of lipoprotein signal peptides in Gram-negative bacteria

    Protein Sci.

    (2003)
  • H. Nielsen et al.

    Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site

    Proteins: Struct. Funct. Genet.

    (1996)
  • A. Bairoch et al.

    The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000

    Nucl. Acids Res.

    (2000)
  • Cited by (5740)

    View all citing articles on Scopus
    View full text