Elsevier

Computational Biology and Chemistry

Volume 65, December 2016, Pages 117-127
Computational Biology and Chemistry

Research article
SVM and SVR-based MHC-binding prediction using a mathematical presentation of peptide sequences

https://doi.org/10.1016/j.compbiolchem.2016.10.011Get rights and content

Highlights

  • The new approach for peptide representation is proposed.

  • Different models for quantitative and qualitative MHC binding prediction are developed.

  • Accuracy of developed models is comparable or outperforms the best currently existing methods.

Abstract

At present, there are a number of methods for the prediction of T-cell epitopes and major histocompatibility complex (MHC)-binding peptides. Despite numerous methods for predicting T-cell epitopes, there still exist limitations that affect the reliability of prevailing methods. For this reason, the development of models with high accuracy are crucial. An accurate prediction of the peptides that bind to specific major histocompatibility complex class I and II (MHC-I and MHC-II) molecules is important for an understanding of the functioning of the immune system and the development of peptide-based vaccines. Peptide binding is the most selective step in identifying T-cell epitopes. In this paper, we present a new approach to predicting MHC-binding ligands that takes into account new weighting schemes for position-based amino acid frequencies, BLOSUM and VOGG substitution of amino acids, and the physicochemical and molecular properties of amino acids. We have made models for quantitatively and qualitatively predicting MHC-binding ligands. Our models are based on two machine learning methods support vector machine (SVM) and support vector regression (SVR), where our models have used for feature selection, several different encoding and weighting schemes for peptides. The resulting models showed comparable, and in some cases better, performance than the best existing predictors. The obtained results indicate that the physicochemical and molecular properties of amino acids (AA) contribute significantly to the peptide-binding affinity.

Introduction

The binding of peptides, derived by the intracellular processing of protein antigen(s) (Ag(s)) to MHC proteins, is the most selective step in defining T-cell epitopes. Computational methods, based on reverse immunology, are essential steps in the identification of T-cell epitope candidates, and complement epitope screening by predicting the best binding peptides. Computational epitope-prediction programs are trained on the known peptide-binding affinities to a particular MHC molecule (or a defined set of MHC molecules) and fall into two categories (Brusic et al., 2004, Yang and Yu, 2009): sequence-based and structure-based. The first category is focused on the primary structure of the analyzed protein Ags and the identification of binding peptides, while the second makes use of the 3D structure of MHC molecules or their binding sites. Sequence-based methods for the prediction of MHC-binding peptides include binding motifs, quantitative matrices, artificial neural networks, hidden Markov models (HMMs) and molecular modeling. Structure-based methods, developed in structural biology for the prediction of potentially good MHC-binders, involve docking of peptides, threading algorithms, binding energy and molecular dynamics to discriminate between binding and non-binding peptides (Patronov and Doytchinova, 2013). In this study, we have only considered sequence-based methods. Of the current sequence-based methods, the most prevalent are those based on machine learning, because they are better at balancing the cost/performance ratio. Experimental methods are very expensive and time consuming, the application of prediction methods that reduce the number of experiments, and thereby cost, can significantly lower the time and money needed for experiments (Huang et al., 2013). Prediction methods are also used when the time needed for the identification of epitopes is crucial for rapid immunization.

The first developed methods for predicting epitopes included only the analysis of sequences and epitope motif alignment (Sette and Fikes, 2003). These methods were later upgraded to position-specific scoring matrix (PSSM) approaches (Yu et al., 2002). A disadvantage of these approaches is the negation of the link between an amino acid (AA) and neighboring molecules, i.e. the assumption that an AA independently appears at an appropriate position and contributes alone to the binding affinity. Another flaw in these methods is the poorer accuracy in predicting T-cell epitopes, which was one of the motives for using the advanced machine learning algorithms mentioned above (Yu et al., 2002).

Novel and more advanced models for predicting T-cell epitopes with a high accuracies frequently appear. However, when we tried to include all these predictors in our previous research (Mitić et al., 2014, Pavlović et al., 2014, Jandrlić et al., 2016, Jandrlić, 2016), we were confronted with numerous restrictions. Predictions were made for single allele or a small number of alleles, models were trained on at most 200–300 peptides. In the lack of experimental data, the data produced through predictions were combined, or taken from different sources, irrespective of the methods used to define the epitopes (Peters et al., 2006). The development of a good model for the prediction of T-cell epitopes is a difficult task because of this lack of well-documented experimental data. It is common to utilize data from published articles or from specialized databases providing information related to T-cell epitopes, such as Syfpeithi (Rammensee et al., 1999), MHCBN (Bhasin et al., 2003), AntiJen (Toseland et al., 2005), HLA Ligand (Sathiamurthy et al., 2003), FIMM (Schönbach et al., 2000). However, algorithm developers are not always aware of the implications of mixing data from different experimental approaches, such as T-cell response, MHC ligand elution and MHC binding data. Even within a single assay category, such as MHC-binding experiments, mixing data from different sources without further standardization can be problematic. The data often had conflicting classifications into both binding and nonbinding peptides (Peters et al., 2006).

This section describes the current most reliable predictors of MHC-binding peptides and their methodology. Predictors from the CBS group (http://www.cbs.dtu.dk/) have proved to be the most reliable, accurate and provide support for a large number of different HLA alleles, even in the absence of experimental data (pan-specific). These predictors belong to two families: NetMHC and NetMHCpan, both are based on artificial neural networks (ANNs). These predictors combine several ANN models, which are based on sparse sequence encoding and BLOSUM50 encoding. ANNs can be utilized to make both qualitative and quantitative predictions. These predictors are still being developed and improved, and are included in a number of tools, including the Immune Epitope Database−IEDB (http://tools.iedb.org/main/tcell/) and CBS (Luo et al., 2015). The next group of predictors, which is also regularly updated and accessible for large-scale analysis, is that from the IEDB (http://tools.immuneepitope.org/mhci/). Most of the predictors available at the IEDB are ANN-based or matrix-based. ANN predictors (Nielsen et al., 2003) use a combination of sparse encoding, BLOSUM encoding and input derived from hidden Markov models as a sequence presentation for different neural networks; these models are then combined. The ARB (average relative binding) predictor (Bui et al., 2005) is a matrix-based method that directly predicts IC50 values; SMM and SMMPMBEC are also matrix-based methods that predict peptide binding to MHC molecules, peptide transport by the transporter associated with antigen processing (TAP) and proteasomal cleavage of proteins (Peters and Sette, 2005). The PickPocket is based on receptor pocket similarities (Kim et al., 2009). There are also predictors that are not mentioned here; a detailed list of predictors with their prediction precision is described (Luo et al., 2015). Most of the predictors are based on sparse or BLOSUM50 encoding of sequences, while different methods of machine learning are used: ANN, HMM, Decision Tree and SVM. The reason why these predictors are not described in detail here is because they are not available as stand-alone applications, and because they cannot be easily tested for larger proteins or for more proteins. To add to this, even when only one protein is considered, the result is obtained only after some time, or the end result is an error message. Of all the predictors presented in the above mentioned work, only POPI (Tung and Ho, 2007) uses physicochemical (PC) properties as input features. However, this predictor only gives a qualitative evaluation of prediction (it is an epitope − it isn’t an epitope), and with very low reported accuracy. Beside mentioned predictors, there are proposed methods which are based on orthonormal encoding strategies and binary encoded PC properties which suggest that certain combination of PC properties can significantly improve classification performance (Gok and Ozcerit, 2012a, Gok and Ozcerit, 2012b). However, proposed methods are only developed for qualitative classification of peptides.

In our previous work (Mitić et al., 2014, Pavlović et al., 2014, Jandrlić et al., 2016), we used predictors from the CBS group; however, our intention to include other predictors that do not belong to CBS group or IEDB tool to compare results with other proteins characteristics, was met by all of the abovementioned problems. This was the reason for creating the support vector machine (SVM) models used for binary classification and regression, based on different sequence encoding strategies. The obtained models predict MHC-binding ligands with great accuracy. Of particular interest was the finding that the combination of PC properties greatly influences the binding affinity of peptides. In order to avoid the problems of inconsistent data to obtain reliable models, such as differing measures of binding affinity, etc., we chose to use only data from the IEDB (http://www.iedb.org/), which is regularly updated, as the most reliable source of MHC-binding ligands.

Section snippets

Datasets

The data source was the Immune Epitope Database (IEDB), June 2015 version. All experimentally proven MHC-binding ligands for all available alleles were downloaded. We limited our research to peptides 9 amino acids (AAs) in length because nonamers are the most common MHC-I epitopes, and because for MHC-I there exist enough experimental data for the construction of good models for a number of alleles. We discarded the data:

  • where there was insufficient information for the construction of a good

Experiments

For each described scheme of encoding, two models were created. One model is for binary classification (prediction of whether a peptide is an epitope or not), and the other is a regression model (to predict the binding affinity of a peptide to a particular allele). Affinity is usually expressed in IC50 values in the range [0–50000] and here the affinity is scaled and represented as 1-log50k (affinity). The scaled value of affinity is in the range [0–1]. The experiments, for all models, were

Conclusion

In this study, we present a position-dependent method to predict peptide binding to MHC class I proteins. Models were made to predict MHC class I-binding ligands for 15 different alleles, and were based on a mathematical representation of a peptide in the form of a vector. Vector components were obtained based on the physicochemical and molecular properties of the amino acids contained in the peptides, weighting schemes used in text classification, but never before used for this type of

Acknowledgement

The work presented has been financially supported by the Ministry of Education, Science and Technological Development, Republic of Serbia, Projects No. 174002.

References (38)

  • Courant, R., Hilbert, D., 1954. Methods of Mathematical Physics, Vol. I. New York and London (Interscience Publishers)....
  • M. Gok et al.

    OETMAP: a new feature encoding scheme for MHC class I binding prediction

    Mol. Cell. Biochem.

    (2012)
  • P. Guan et al.

    Analysis of peptide-protein binding using amino acid descriptors: prediction and experimental verification for human histocompatibility complex HLA-A0201

    J. Med. Chem.

    (2005)
  • J.A. Hartigan

    Clustering Algorithms

    (1975)
  • S. Henikoff et al.

    Amino acid substitution matrices from protein blocks

    Proc. Natl. Acad. Sci. U. S. A.

    (1992)
  • D.R. Jandrlić et al.

    Software tools for simultaneous data visualization and T cell epitopes and disorder prediction in proteins

    J. Biomed. Inform.

    (2016)
  • D. Jandrlić

    The rule based classification models for MHC binding prediction and identification of the most relevant physicochemical properties for the individual allele

    Univ. Thought – Publ. Nat. Sci.

    (2016)
  • T. Joachims

    A probabilistic analysis of the rocchio algorithm with tfidf for text categorization

    International Conference on Machine Learning (ICML)

    (1997)
  • T. Joachims

    Text categorization with Support Vector Machines: learning with many relevant features

    (2005)
  • Cited by (9)

    • A systematic review on the state-of-the-art strategies for protein representation

      2023, Computers in Biology and Medicine
      Citation Excerpt :

      The PSSM descriptors are typically extracted using a position-specific iterative blast (PSI-BLAST) program. The BLOSUM matrix, similar to a PSSM, is a traditional descriptor based on evolutionary information and is used in sequence similarity studies [78], T-cell epitope prediction [79], and other research. The values in the BLOSUM matrix reflect the probability of the exchange of pairs of amino acids.

    • Computational immunogenetics

      2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics
    • Operational Electricity Dispatch Based on Direct Normal Irradiance (DNI) and Load Forecasting: Case Study: STTP with TES system

      2019, Proceedings of 2019 7th International Renewable and Sustainable Energy Conference, IRSEC 2019
    View all citing articles on Scopus
    View full text