Research articleSVM and SVR-based MHC-binding prediction using a mathematical presentation of peptide sequences
Graphical abstract
Introduction
The binding of peptides, derived by the intracellular processing of protein antigen(s) (Ag(s)) to MHC proteins, is the most selective step in defining T-cell epitopes. Computational methods, based on reverse immunology, are essential steps in the identification of T-cell epitope candidates, and complement epitope screening by predicting the best binding peptides. Computational epitope-prediction programs are trained on the known peptide-binding affinities to a particular MHC molecule (or a defined set of MHC molecules) and fall into two categories (Brusic et al., 2004, Yang and Yu, 2009): sequence-based and structure-based. The first category is focused on the primary structure of the analyzed protein Ags and the identification of binding peptides, while the second makes use of the 3D structure of MHC molecules or their binding sites. Sequence-based methods for the prediction of MHC-binding peptides include binding motifs, quantitative matrices, artificial neural networks, hidden Markov models (HMMs) and molecular modeling. Structure-based methods, developed in structural biology for the prediction of potentially good MHC-binders, involve docking of peptides, threading algorithms, binding energy and molecular dynamics to discriminate between binding and non-binding peptides (Patronov and Doytchinova, 2013). In this study, we have only considered sequence-based methods. Of the current sequence-based methods, the most prevalent are those based on machine learning, because they are better at balancing the cost/performance ratio. Experimental methods are very expensive and time consuming, the application of prediction methods that reduce the number of experiments, and thereby cost, can significantly lower the time and money needed for experiments (Huang et al., 2013). Prediction methods are also used when the time needed for the identification of epitopes is crucial for rapid immunization.
The first developed methods for predicting epitopes included only the analysis of sequences and epitope motif alignment (Sette and Fikes, 2003). These methods were later upgraded to position-specific scoring matrix (PSSM) approaches (Yu et al., 2002). A disadvantage of these approaches is the negation of the link between an amino acid (AA) and neighboring molecules, i.e. the assumption that an AA independently appears at an appropriate position and contributes alone to the binding affinity. Another flaw in these methods is the poorer accuracy in predicting T-cell epitopes, which was one of the motives for using the advanced machine learning algorithms mentioned above (Yu et al., 2002).
Novel and more advanced models for predicting T-cell epitopes with a high accuracies frequently appear. However, when we tried to include all these predictors in our previous research (Mitić et al., 2014, Pavlović et al., 2014, Jandrlić et al., 2016, Jandrlić, 2016), we were confronted with numerous restrictions. Predictions were made for single allele or a small number of alleles, models were trained on at most 200–300 peptides. In the lack of experimental data, the data produced through predictions were combined, or taken from different sources, irrespective of the methods used to define the epitopes (Peters et al., 2006). The development of a good model for the prediction of T-cell epitopes is a difficult task because of this lack of well-documented experimental data. It is common to utilize data from published articles or from specialized databases providing information related to T-cell epitopes, such as Syfpeithi (Rammensee et al., 1999), MHCBN (Bhasin et al., 2003), AntiJen (Toseland et al., 2005), HLA Ligand (Sathiamurthy et al., 2003), FIMM (Schönbach et al., 2000). However, algorithm developers are not always aware of the implications of mixing data from different experimental approaches, such as T-cell response, MHC ligand elution and MHC binding data. Even within a single assay category, such as MHC-binding experiments, mixing data from different sources without further standardization can be problematic. The data often had conflicting classifications into both binding and nonbinding peptides (Peters et al., 2006).
This section describes the current most reliable predictors of MHC-binding peptides and their methodology. Predictors from the CBS group (http://www.cbs.dtu.dk/) have proved to be the most reliable, accurate and provide support for a large number of different HLA alleles, even in the absence of experimental data (pan-specific). These predictors belong to two families: NetMHC and NetMHCpan, both are based on artificial neural networks (ANNs). These predictors combine several ANN models, which are based on sparse sequence encoding and BLOSUM50 encoding. ANNs can be utilized to make both qualitative and quantitative predictions. These predictors are still being developed and improved, and are included in a number of tools, including the Immune Epitope Database−IEDB (http://tools.iedb.org/main/tcell/) and CBS (Luo et al., 2015). The next group of predictors, which is also regularly updated and accessible for large-scale analysis, is that from the IEDB (http://tools.immuneepitope.org/mhci/). Most of the predictors available at the IEDB are ANN-based or matrix-based. ANN predictors (Nielsen et al., 2003) use a combination of sparse encoding, BLOSUM encoding and input derived from hidden Markov models as a sequence presentation for different neural networks; these models are then combined. The ARB (average relative binding) predictor (Bui et al., 2005) is a matrix-based method that directly predicts IC50 values; SMM and SMMPMBEC are also matrix-based methods that predict peptide binding to MHC molecules, peptide transport by the transporter associated with antigen processing (TAP) and proteasomal cleavage of proteins (Peters and Sette, 2005). The PickPocket is based on receptor pocket similarities (Kim et al., 2009). There are also predictors that are not mentioned here; a detailed list of predictors with their prediction precision is described (Luo et al., 2015). Most of the predictors are based on sparse or BLOSUM50 encoding of sequences, while different methods of machine learning are used: ANN, HMM, Decision Tree and SVM. The reason why these predictors are not described in detail here is because they are not available as stand-alone applications, and because they cannot be easily tested for larger proteins or for more proteins. To add to this, even when only one protein is considered, the result is obtained only after some time, or the end result is an error message. Of all the predictors presented in the above mentioned work, only POPI (Tung and Ho, 2007) uses physicochemical (PC) properties as input features. However, this predictor only gives a qualitative evaluation of prediction (it is an epitope − it isn’t an epitope), and with very low reported accuracy. Beside mentioned predictors, there are proposed methods which are based on orthonormal encoding strategies and binary encoded PC properties which suggest that certain combination of PC properties can significantly improve classification performance (Gok and Ozcerit, 2012a, Gok and Ozcerit, 2012b). However, proposed methods are only developed for qualitative classification of peptides.
In our previous work (Mitić et al., 2014, Pavlović et al., 2014, Jandrlić et al., 2016), we used predictors from the CBS group; however, our intention to include other predictors that do not belong to CBS group or IEDB tool to compare results with other proteins characteristics, was met by all of the abovementioned problems. This was the reason for creating the support vector machine (SVM) models used for binary classification and regression, based on different sequence encoding strategies. The obtained models predict MHC-binding ligands with great accuracy. Of particular interest was the finding that the combination of PC properties greatly influences the binding affinity of peptides. In order to avoid the problems of inconsistent data to obtain reliable models, such as differing measures of binding affinity, etc., we chose to use only data from the IEDB (http://www.iedb.org/), which is regularly updated, as the most reliable source of MHC-binding ligands.
Section snippets
Datasets
The data source was the Immune Epitope Database (IEDB), June 2015 version. All experimentally proven MHC-binding ligands for all available alleles were downloaded. We limited our research to peptides 9 amino acids (AAs) in length because nonamers are the most common MHC-I epitopes, and because for MHC-I there exist enough experimental data for the construction of good models for a number of alleles. We discarded the data:
- •
where there was insufficient information for the construction of a good
Experiments
For each described scheme of encoding, two models were created. One model is for binary classification (prediction of whether a peptide is an epitope or not), and the other is a regression model (to predict the binding affinity of a peptide to a particular allele). Affinity is usually expressed in IC50 values in the range [0–50000] and here the affinity is scaled and represented as 1-log50k (affinity). The scaled value of affinity is in the range [0–1]. The experiments, for all models, were
Conclusion
In this study, we present a position-dependent method to predict peptide binding to MHC class I proteins. Models were made to predict MHC class I-binding ligands for 15 different alleles, and were based on a mathematical representation of a peptide in the form of a vector. Vector components were obtained based on the physicochemical and molecular properties of the amino acids contained in the peptides, weighting schemes used in text classification, but never before used for this type of
Acknowledgement
The work presented has been financially supported by the Ministry of Education, Science and Technological Development, Republic of Serbia, Projects No. 174002.
References (38)
- et al.
Computational methods for prediction of T-cell epitopes-a framework for modelling, testing, and applications
Methods
(2004) - et al.
Prediction of MHC class I binding peptides with a new feature encoding technique
Cell Immunol.
(2012) - et al.
Using random forest to classify T-cell epitopes based on amino acid properties and molecular features
Anal. Chim. Acta
(2013) - et al.
Epitope distribution in ordered and disordered protein regions − part A T-cell epitope frequency, affinity and hydropathy
J. Immunol. Methods
(2014) - et al.
Epitope distribution in ordered and disordered protein regions. Part B − ordered regions and disordered binding sites are targets of T- and B-cell immunity
J. Immunol. Methods
(2014) - et al.
Epitope-based vaccines: an update on epitope identification: vaccine design and delivery
Curr. Opin. Immunol.
(2003) - et al.
An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited
J. Mol. Biol.
(1995) - et al.
MHCBN: a comprehensive database of MHC binding and non-binding peptides
Bioinformatics
(2003) - et al.
Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications
Immunogenetics
(2005) - et al.
Support-vector networks
Machine Learning
(1995)
OETMAP: a new feature encoding scheme for MHC class I binding prediction
Mol. Cell. Biochem.
Analysis of peptide-protein binding using amino acid descriptors: prediction and experimental verification for human histocompatibility complex HLA-A0201
J. Med. Chem.
Clustering Algorithms
Amino acid substitution matrices from protein blocks
Proc. Natl. Acad. Sci. U. S. A.
Software tools for simultaneous data visualization and T cell epitopes and disorder prediction in proteins
J. Biomed. Inform.
The rule based classification models for MHC binding prediction and identification of the most relevant physicochemical properties for the individual allele
Univ. Thought – Publ. Nat. Sci.
A probabilistic analysis of the rocchio algorithm with tfidf for text categorization
International Conference on Machine Learning (ICML)
Text categorization with Support Vector Machines: learning with many relevant features
Cited by (9)
A systematic review on the state-of-the-art strategies for protein representation
2023, Computers in Biology and MedicineCitation Excerpt :The PSSM descriptors are typically extracted using a position-specific iterative blast (PSI-BLAST) program. The BLOSUM matrix, similar to a PSSM, is a traditional descriptor based on evolutionary information and is used in sequence similarity studies [78], T-cell epitope prediction [79], and other research. The values in the BLOSUM matrix reflect the probability of the exchange of pairs of amino acids.
Computational immunogenetics
2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of BioinformaticsComputational Methods in Immunology and Vaccinology: Design and Development of Antibodies and Immunogens
2023, Journal of Chemical Theory and ComputationHeavy chain sequence-based classifier for the specificity of human antibodies
2022, Briefings in BioinformaticsOperational Electricity Dispatch Based on Direct Normal Irradiance (DNI) and Load Forecasting: Case Study: STTP with TES system
2019, Proceedings of 2019 7th International Renewable and Sustainable Energy Conference, IRSEC 2019