Introduction

MHC class-I and II antigens are of immense importance to the immune system. MHC molecules (also known as the HLA molecules in humans) undertake the key dialogs between T cells and other cells of the body. First, antigenic peptides are bound in an extended conformation within the grooves of MHC molecules, which feature pockets into which anchoring peptide side chains can fit, in the cytoplasm [1]. Second, MHC molecules present peptides to T Helper lymphocytes and Cytotoxic T lymphocytes (CTL) on the cell surface. The recognition of presented peptides by CTL cells triggers an immune response and is termed T-cell epitopes. In this way, virally infected cells, pathologically mutated cells, and tumor cells are discriminated from healthy cells. The activation of CTL in the immune system requires presentation of endogenous antigenic peptides by MHC class-I molecules [2]. Identification of epitopes and peptides that can bind MHC molecules evoke the design of peptide-based vaccine and immunotherapy [3]. Occurrence of MHC/peptide binding that initiates an immune response is in the range of 0.1–5% for any given protein of which some 20% remain functionally relevant [4]. Hence, computational prediction of MHC/peptide binding can save experimental efforts and time.

In the prediction of MHC specificity, sequence- and structure-based methods were used for classification. If the experimental data is sufficient, sequence-based methods are more efficient than structure-based methods. The core binding motif of both MHC I and II is composed of almost nine amino acids [5]. Therefore, the specificity of an MHC I molecule can be analyzed from a set of 9-mer peptides known to bind to a given allele.

There are a number of data stores describing the binding specificities for MHC molecules. The Immune epitope database and analysis resource (IEDB) [6] and SYFPEITHI database [7] are the main data repositories and services. Apart from IEDB and SYFPEITHI, MHCPEP [8], and MHCBN [9] are other data stores widely used for MHC alleles. But, IEDB has more entries than others and more up-to-date relatively.

The traditional feature encoding model is primarily based on the amino acid composition model. However, the amino acid composition model alone ignores a certain amount of information of the protein sequence. Unfortunately, the information about the sequence order effect cannot be easily incorporated into a pattern recognition model [10].

In this article, eight encoding schemes are evaluated to predict MHC/peptide binding. The first is OE which is a common encoding technique. According to OE, each amino acid symbol \( P_{i} \) in a peptide is replaced by an orthonormal vector \( d_{i} = (\delta_{i1} ,\delta_{i2} , \ldots ,\delta_{i20} ) \) where \( \delta_{ij} \) is the Kronecker delta symbol. Then, each \( P_{i} \) is then represented by a 20-bit vector, 19 bits are set to zero, and 1 bit is set to one based on alphabetic order of amino acids. Each \( d_{i} \) vector is orthogonal to all other \( d_{i} \) vectors and \( P_{i} \) can be any one of the twenty amino acids [11]. Each nonamer thereby is represented by a vector of 180 bits. The main drawback of OE technique is that OE binary feature vectors result in information loss.

Another common approach is the frequency based method. In this method, weight of each amino acid \( P_{i} \) in a peptide is determined and then combined by OE. In this way, vector \( d_{i} \) is multiplied by the weight of amino acid \( P_{i} \) [12]. Frequency based technique preserves the original number of attributes.

Amino acids of homologous sequences which are frequently substituted by each other over time are regarded as similar and the relationships are portrayed by substitution matrices, like the BLOSUM50 and the BLOSUM62 matrices [13]. In [14], authors described a new encoding scheme named BLOMAP which utilizes a non-linear projection method to recognize the similarity information in the BLOSUM62 matrix. The BLOMAP is an improved method of the Sammon-projection mapping.

Another encoding scheme technique is n-Grams or k-tuples [15], a pair of values \( (v_{i} ,c_{i} ) \), where \( v_{i} \) is the feature and \( c_{i} \) is the counts of this feature in a protein sequence. These features are all the possible combinations of n amino acids from 20 amino acids.

Zvelebil et al. [16] proposed a new encoding method based on Taylor’s Venn-diagram (TVD) [17] which describes the membership of an amino acid to one of ten classes as a binary vector. The Zvelebil-encoding technique utilizes physicochemical properties of amino acids without high dimensionality.

In [18], authors inspired by Chou’s quasi-sequence-order model and Yuan’s Markov chain model and developed Residue-couple (RC) encoding technique. RC model takes into account not only the amino acid consecutive pairs but also the gapped amino acid pairs corresponding, respectively.

In [19], four sequence-based approaches, DynaPredPOS, NetMHC, SVMHC, and YKW have been experimented for predicting peptide binding to MHC class I molecules.

DynaPredPOS prediction method uses two feature matrices derived from structural calculations as basis for support vector machine training: a local, position-dependent (DynaPredPOS) and a global, position-independent (DynaPred) matrix [20]. SVMHC is a SVM-based prediction technique whose kernels were optimized by systematic variation of the parameters [21]. YKW method is based on data-derived matrices [22]. Predictions of NetMHCFootnote 1 method based on artificial neural networks trained on data from 55 MHC alleles (43 Human and 12 non-human), and position-specific scoring matrices for additional 67 HLA alleles [23]. NetMHC is the state-of-the-art predictive model for MHC/peptide binding [19]. Their prediction performance was evaluated on three up-to-date datasets.

In this article, we have investigated a new feature encoding method that combines the sequence order of the residue composition based on OE and the representation of various relationships of residue based on TVD. This encoding scheme, termed OETMAP, has been applied to LSVM for MHC binding predictions. The computational results demonstrate higher performance of OETMAP technique in comparison with the feature encoding methods re-implemented on a standalone classifier approaches. Having compared the performance of eight encoding methods, we have conducted another comparison between OETMAP and four featured MHC class prediction methods namely DynaPredPOS, NetMHC, SVMHC, and YKW as implemented in [19].

Methods

Datasets

We conducted our tests on three up-to-date datasetsFootnote 2 (Full (F), Intermediate (I), Strong (S)) composed of sequences of a set of 9-mer peptides known to bind to a given allele. Dataset F includes all available binders and non-binders in IEDB, dataset I includes only weak binders (50–500 nM binding affinity) and non-binders (500–1000 nM binding affinity), and dataset S included only strong binders (less than 10 nM binding affinity) and very clear non-binders (greater than 10,000 nM binding affinity) as outlined in [18].

Support vector machines

SVM is an effective discriminative classification method of statistic learning theory and in recent times, it is successively applied by a number of other researchers. SVM aims to find the maximum margin hyperplane to separate two classes of patterns. A transform to map nonlinearly, the data into a higher dimensional space allows a linear separation of classes which could not be linearly separated in the original space. The objects that are located on these two hyperplanes are the so-called support vectors. The maximum margin hyperplane, which is uniquely defined by the support vectors, gives the best separation between the classes [24]. In the tests, LSVM algorithm was applied by OSU Toolbox [25].

Proposed encoding scheme

OE is a common method for the representation of sequences. It provides that all the vectors, obtained as binary from amino acids sequences, are mutually orthogonal and all of unit length [11]. But OE lacks of knowledge and sequence homology about proteins. We believe that amino acids sequences that are co-localized must share some similarity from the point of amino acid physicochemical properties. TVD shows the relationship of the 20 naturally occurring amino acids to a selection of physicochemical properties which are important in the determination of protein tertiary structure. TVD shown in Fig. 1 was based on the 2-D arrangement derived from Dayhoff’s mutation matrix. Amino acids were then displaced from this arrangement to form groups of residues related by common physicochemical properties [17]. We believe that TVD extrapolates from physicochemical relationship of amino acids for prediction of MHC class I bindings. That is, similar physicochemical types of amino acids occur at a position for a binding peptide. But TVD is not enough itself for the amino acid representation. Consequently, OETMAP we developed consists of a conjunction of OE and TVD methods which are complementary to each other.

Fig. 1
figure 1

TVD; amino acids are classified with respect to ten synthesis of physicochemical properties and mutation data (Adopted from Ref. [14])

Let P be an amino acid sequence in MHC binding dataset. P i be the ith amino acid in P (for i = 1, 2,…, L where L is the length of the amino acid sequence).

Corresponding to each P i , we build feature vectors \( \{ \vec{y}\}_{i}^{1} \) and \( \{ \vec{y}\}_{i}^{2} \) as follows: \( \{ \vec{y}\}_{i}^{1} \) is the OE vector (20-bit) for P i . The value of i represents the 20 different amino acids (briefly denoted as A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V).

\( \{ \vec{y}\}_{i}^{2} \) is a binary feature vector composed of ten bits obtained from TVD, shown in Table 1.

Table 1 Binary code vectors for \( \{ \vec{y}\}^{2} \)

The built feature vectors \( \{ \vec{y}\}_{i}^{1} \) and \( \{ \vec{y}\}_{i}^{2} \)of each of P i is concatenated:

$$ \{ \vec{y}\}_{i} = \left( {\{ \vec{y}\}_{i}^{1} \parallel \{ \vec{y}\}_{i}^{2} } \right) $$

Finally the feature vector \( \vec{\chi } \) of P, which has a dimension of \( 30 \times L \), is revealed in succession as follows:

$$ \vec{\chi } = \left( {\{ \vec{y}\}_{1} \parallel \{ \vec{y}\}_{2} \parallel \cdot \cdot \cdot \parallel \{ \vec{y}\}_{L} } \right) $$

To explain the new measure, we demonstrate an example of computation below for the sequence ALDFEQEMT in MHC binding dataset. For residue D in the peptide sequence, we have computed OE mapping as follows:

$$ \{ \vec{y}\}_{3}^{1} = \, \left[ {0 \, 0 \, 0{ 1 }0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0} \right] $$

Next, we have computed \( \{ \vec{y}\}_{3}^{2} \):

$$ \{ \vec{y}\}_{3}^{2} = \, \left[ {0 \, 0{ 1 1 1 1 }0 \, 0 \, 0 \, 0} \right] $$

\( \{ \vec{y}\}_{3} \) is then computed as:

$$ \{ \vec{y}\}_{3} = \left[ {0 \, 0 \, 0{ 1 }0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0 \, 0{ 1 1 1 1 }0 \, 0 \, 0 \, 0} \right] $$

These computation procedures are repeated for each amino acid of the peptide sequence to obtain the feature vector \( \vec{\chi } \).

Results

Performance of feature representation method

10-fold cross validation (10-fold CV) testing protocol is applied to evaluate the performance of the methods in terms of area under receiver operating characteristic (ROC) curve (AUC) averaged over ten experiments on datasets. In a cross-validation run, the 10-folds are randomly created [26]. In 10-fold CV, the encoding scheme methods are trained using 90% of the data and the remaining 10% of the data are used for testing of the methods. This process is repeated ten times so that each peptide in datasets is used once. The 10-folds used in the training are different from the 10-folds used in the testing. Then the average AUC of the each method over these ten turns are obtained. The performance of proposed feature encoding methods on datasets F, I, and S is shown in Tables 2 and 3, respectively, by means of AUC which is defined as the area under the ROC curve where a ROC curve is plotted as the number of true positives as a function of false positives for varying classification thresholds to describe the performance of a model across the entire range of classification thresholds [27]. OE, combining the OE representation with frequency based method (OE + Freq.), the substitution matrix BLOSUM50 (OE + B50), 2g, RC, BLOMAP, and TVD methods have been evaluated.

Table 2 Prediction performance of amino acid encoding schemes according to AUC values on datasets
Table 3 OETMAP is compared with other prediction models according to AUC performance on three datasets

Table 2 reports that OETMAP outperforms the competing encoding techniques considered for dataset F with the value of 0.87. Note that n-grams obtained the worst performance. The predictions on dataset I were poor (the highest average AUC value achieved was 0.59). It is obvious that intermediate binders were difficult to classify. Table 2 points out TVD achieved the best results on dataset I. However, once again n-grams and RC methods obtained the worst performance as is dataset F. Dataset S includes certain 9-mer peptides (i.e., strong binders and clear non-binders) and therefore, the best performance has been obtained when dataset S used. OETMAP has achieved the best result with the AUC value of 0.951 on dataset S. According to average values, the highest performance are achieved by OETMAP with the value of 0.801 compared other encoding schemes. We notice that OETMAP combines the both effectiveness of OE and TVD.

Table 3 reports that NetMHC is the best among the five predictive models particularly on dataset S. It may arise that some of the data used to train NetMHC is probably identical to that extracted from IEDB for this study as NetMHC was only accessible via a web interface [19]. YKW and OETMAP followed NetMHC where dataset I and dataset F were used, respectively. DynaPredPOS, SVMHC, YKW, and OETMAP have been drawn in case of dataset S.

Conclusion

In this article, we have studied the problem of whether given a nonamer peptide of any MHC allele is binding or non-binding by means of a new encoding scheme method. It is revealed from the experimental results that the new encoding scheme can accurately predict the MHC/peptide binding with high sensitivity on a standalone classification algorithm (LSVM) according to three up-to-date MHC class I datasets. Our proposed method can be used for other machine learning methods and can be used for any kind of peptide classification problems as well. Because independent and accurate classifiers make errors on different regions of the feature space, they can be ensemble. Hence, future works will involve the ensemble of classifiers with OETMAP encoding scheme.

Reproducibility material

We reported some MatLab code used for obtaining the empirical results in this article are available at: http://www.sakarya.edu.tr/aozcerit/codeMHC.rar.