Keywords

1 Introduction

T-cells [1] are specialized immune cells, playing a crucial role in activation of the adaptive immune system. Once the HLA II proteins have established a stable binding with the exogenous peptide antigens (T-cell epitopes), they are transported on the extracellular domain of the antigen-presenting cell (APCs). The T-cell receptors (TCRs) located on the surface of the T-cells, interacts with the HLA II-peptide constituting trimeric complexes responsible for the activation of the T-cell CD4+. HLA II proteins are encoded in three different genetic loci: HLA-DP, HLA-DQ and HLA-DR (also called isotypes) and are constituted of two separate protein chains: \(\alpha \) and \(\beta \) [2]. They contain an open-ended binding cleft which allows the peptides to accommodate using multiple binding frames [3]. Thus the complexity in binding prediction problem is significantly increased.

In this regard, different computational techniques have been developed to predict HLA class II binding. Among them sequence-based and structure-based approaches are the most popular. Sequence-based methods include matrix models [4], binding motif recognition [5], artificial neural network [6], quantitative matrices [7], hidden markov models [8], support vector machines [9] and QSAR [10] based methods. Structure-based methods involve threading algorithms [11], peptide docking [12] and molecular dynamics [13]. Sophisticated methods, such as an iterative meta-search algorithm [14] and ant colony search [15] have been developed to resolve the dynamic variable length problem of HLA class II proteins prediction. Apart from this, some of the recent approaches has also significantly outperformed more traditional methods [16–18].

In this article, sequence-based information is used to predict the HLA II isotype binding propensity of peptides. For this purpose, new multiclass training and testing datasets of HLA II proteins are prepared from raw binding affinity interaction dataset of 27 HLA II proteins and 636 peptides. Thereafter, an ensemble based multiclass classifier, called Meta EnsembleR (MaER) is developed for the classification of peptides into different HLA II isotypes. The development of an ensemble based approach is basically motivated by the improving performance of similar other methods in bioinformatics [18–23]. The MaER classifier is build on the top of widely used four heterogeneous multiclass classifiers like Support Vector Machine (SVM) [24], Decision Tree (DT) [25], Naive Bayes (NB) [26] and \(K\)-Nearest Neighbor (K-NN) [27]. Before the classification task, MaER splits the training dataset into different number of rotational non-overlapping subsets. Subsequently, bootstrap sampling and principal component analysis are used for each subset. All the major principal components (in terms of coefficient and depending on eigenvalues) for all subsets are retained to create an informative set that preserve the diversity of the original training data. After that, such informative set is multiplied with original training and testing datasets before being classified. In MaER, finally a consensus of ensemble results is produced and for this purpose, selection of classifier is done randomly for different values of ensemble size. The performance of the MaER is reported in comparison with the individual classifies like SVM, DT, NB and \(K\)-NN in terms of average accuracy, precision or positive predictive value (PPV), recall, F-measure, Matthews correlation coefficient (MCC) and area under the ROC curve (AUC) values. Finally, Friedman test [28] has been conducted to judge the statistical significance of the results produced by MaER.

Fig. 1.
figure 1

Block diagrams of (a) Dataset Preparation and (b) MaER Algorithm

Table 1. Statistics of dataset used for MaER

2 Materials and Methods

2.1 Preparation of Dataset

The dataset contains IC50 binding-affinity values originally measured between 27 HLA II proteins and 636 peptides derived from Phleumpratense [29]. The raw affinity dataset is transformed into a binary binding matrix defining binding and non-binding events solely. The IC50 value of 1000 nM is used as a threshold, since it represents the common reference to define HLA II-peptide binding events in literature [20, 29]. Thereafter, the percentage of positive activity (PPA) is separately computed for each isotype \(\phi \) (DR, DQ and DP) from the binary binding matrix. Initially, the highest number of HLA proteins capable of binding a single peptide \(x\), is used as a reference, e.g., \(\rho (DR)\), within each isotype. Consequently, three (one for each isotype) PPAs values are calculated for each given peptide. The PPA value for a given peptide, and for a given isotype is simply defined as the proportion of positive values that this peptide has for that particular isotype, with respect to the reference value for the same isotype, e.g., \(\rho (DR)\).

$$\begin{aligned} PPA(x_{i})_{\phi } = \frac{\sum _{j=1}^{{|\phi |}}\textit{a}_{i,j}}{\rho ({\phi })}~~~~~~ \forall ~ i \le |X|, \forall {\phi } \in \varPhi ~~~~~~~~~~ \end{aligned}$$
(1)

where

$$\begin{aligned} {\rho (\phi )} = \max _{1<i<|X|} \sum _{j=1}^{{|\phi |}}\textit{a}_{i,j}~~~~~~ \forall ~ i \le |X|,~~ and~~ {\varPhi } = {\{DR, DQ, DP\}} \end{aligned}$$
(2)

Each peptide \({x}\), does now have three PPAs: one for each isotype \(\phi \). The maximum \({max}_{1}({x})\) and second highest maximum \({max}_{2}({x})\) PPA values are identified for each peptide. The difference value \(\varDelta {x} = {max}_{1}({x}) - {max}_{2}({x})\) is used to determine if the given peptide has either strong or weak binding properties, with respect to a given threshold \({k}\). If \(\varDelta {x} \ge {k}\) then the corresponding peptide is consider a strong binder for the class \(\phi (x) \mapsto {max}_{1}({x})\) otherwise it is considered as a weak binder for the classes \((\phi _{1}(x), \phi _{2}(x)) \mapsto ({max}_{1}({x}), {max}_{2}({x}))\).

The peptide classification into multiple isotype is defined with respect to the threshold value \(k\). As the lower \(k\) value increases the number of peptides, defined as strong HLA binders. The statistics about the effect of variation of \(k\) both in terms of isotype ratio and percentage of strong binders, are given in Table 1 and has been taken into account for the choice of \(k\). Ultimately, the value of \(k\) has been set to 0.15, in order to either maintain a comparable ratio among the isotypes, and grant a equal definition of strong and weak binding peptides.

Since the classification technique requires a common number of features for each peptide, a common length of 15 AAs is adopted. In the homogenized dataset, the edging AAs of peptides longer than 15 AA are shorted. The dissection is performed upon an accurate comparative analysis of the less conserved residues within the original peptides. In order to represent the entire pool of 636 peptides in a numerical form, a 40 high-quality AA indices (HQI40) [20, 30, 31] are used. Therefore, the length of the peptide sequence is 15\(\times \)40=600. The block diagram representation of the experiment with data generation is given in Fig. 1(a).

2.2 The Proposed MetaEnsembleR

Meta EnsembleR (MaER) is an ensemble based classifier, where four heterogeneous classifiers like support vector machine, naive bayes, decision tree and \(K\)-nearest neighbor are used. It creates a diverse set of training points by preparing different non-overlapping sets of features. In order to discuss MaER, some notations are introduced here. Let \(\mathcal {P}\) is a matrix of size \(n \times M\), which consists of \(M\) input attributes or features values for each training instance and \(\mathcal {Q}\) be an one dimensional column vector contains the output attribute of each training instance in \(\hbar \). Therefore, \(\hbar \) can be expressed as after concatenating \(\mathcal {P}\) and \(\mathcal {Q}\) horizontally, i.e., \(\hbar = [\mathcal {P} \mathcal {Q}]\). Also let \(\mathcal {F}\) = \(\lbrace \mathcal {P}_1, \mathcal {P}_2,\ldots , \mathcal {P}_M \rbrace \) and \(\mathcal {T}\) are the set of features (\(M\ge 4\)) and ensemble size. Therefore, it can be assumed that a training set of \(n\) labelled instances \(\hbar =\lbrace p_j, q_j\rbrace _{j=1}^n\) in which each instance \((p_j,q_j)\) is described by \(M\) input attributes or features and an output attribute, i.e., \(p \in \mathbb {R}^n\) and \(q \in \mathbb {R}\) where \(q\) takes a value from the label space \(\lbrace L_1,L_2,\ldots ,L_c \rbrace \). In a classification task, the goal is to use the information only from \(\hbar \) to construct a classifier which can perform well on the unseen data. Note that in MaER, the feature set, \(\mathcal {F}\) = \(\lbrace \mathcal {P}_1, \mathcal {P}_2,\ldots , \mathcal {P}_M \rbrace \), splits into \(\mathcal {S}\) number of feature subsets, where \(\mathcal {S} \in [ 2, \lfloor \frac{M}{2} \rfloor ]\). Also from the pool of classifiers, one classifier is randomly selected for the each value of \(\mathcal {T}\). In order to construct the training and testing datasets for a classifier in ensemble, the following necessary steps are performed.

  • Step1: Randomly split \(\mathcal {F}\) into \(\mathcal {S}\) number of subsets, i.e.,\(\mathcal {S}_{s,t}\) for simplicity, where \(t\) counts the ensemble size and \(s\) signifies the current attribute or feature subset. As \(\mathcal {S} \in [ 2, \lfloor \frac{M}{2} \rfloor ]\), therefore, the minimum number of subsets is 2 with at least 2 features in each subset is considered.

  • Step2: Repeat the following steps \(\mathcal {S}\) times for each subset, i.e., \(s = 1, 2,\ldots ,\mathcal {S}\).

    • (a) A new submatrix \(\mathcal {P}_{s,t}\) is constructed which corresponds to the data in matrix \(\mathcal {P}\).

    • (b) From this new submatrix a bootstrap sample \(\mathcal {P}^{\prime }_{s,t}\) is drawn where the sample size is generally smaller than \(\mathcal {P}_{s,t}\).

    • (c) Thereafter, \(\mathcal {P}^{\prime }_{s,t}\) is used for PCA and the coefficients of all computed principal components are stored into a new matrix \(\mathcal {C}_{t,s}\).

  • Step3: In order to have a matrix of same size of feature, arrange each \(\mathcal {C}_{t,s}\) into a block diagonal sparse matrix \(\mathcal {D}_t\). Once the coefficients in \(\mathcal {C}_{t,s}\) are placed in to the block diagonal sparse matrix \(\mathcal {D}_t\), the rows of \(\mathcal {D}_t\) are rearranged so that the order of them corresponds to the original attributes in \(\mathcal {F}\). During this rearrangement, columns with all zero values are removed from the sparse matrix.

  • Step4: The rearranged rotation matrix \(\mathcal {D}^r_t\) is then used as \([\mathcal {P}\mathcal {D}^r_t; \mathcal {Q}]\) and \([\mathcal {I}\mathcal {D}^r_t]\) for training and test sets of classifier, where \(\mathcal {I}\) is a given test sample.

  • Step5: In the testing phase, let \( MaER _{v,t}(\mathcal {I}\mathcal {D}^r_t)\) be the posterior probability produced by the classifier \( MaER _t\) on the hypothesis that \(\mathcal {I}\) belongs to class \(L_v\). Then the confidence for a class is calculated by the average posterior probability of combination base classifiers:

    $$\begin{aligned} \mathcal {Q}_v (\mathcal {I})=\frac{1}{\mathcal {T}} \sum _{t=1}^\mathcal {T} MaER _{v,t} (\mathcal {I}\mathcal {D}^r_t),~~ where ~~v=1,2,\ldots ,c \end{aligned}$$
    (3)

    Thereafter, \(\mathcal {I}\) is assigned to the class with the largest confidence. Note that all the five steps will repeat for \(t= 1,2,\ldots ,\mathcal {T}\).

This is to be noted that due to the process of random feature subdivision, in each iteration, the selected classifier in MaER will have new sets of training and testing data, which will help to diversify the ensemble of classifiers in order to get better classification results. The MaER is applied to predict the multiclass binding activity of HLA class II protein, i.e., DP, DQ and DR at a time. For this purpose, based on threshold value \(k\)=0.15, multiclass dataset of strong binding peptides is used to train the MaER to predict the classes of weak binding peptides. Note that here the train and test datasets are normalized, where each input data is normalized to the range [0,1]. The flowchart of MaER is shown in Fig. 1(b).

3 Results and Discussions

The MaER contains four basic classifiers like SVM, ND, DT and \(K\)-NN. Therefore, setting the parameters of these classifiers is equivalent to set the parameters of MaER. Thus the parameters of SVM such as \(\gamma \) for kernel function and the soft margin \(\mathcal {C}\) (cost parameter), are set to be \(0.5\) and \(2.0\), respectively. Note that, for SVM and DT, RBF (Radial Basis Function) kernel and C4.5 classifier are used. The value of \(\textit{K}\) for the \(\textit{K}\)-NN classifier, ensemble size \(\mathcal {T}\) and number of feature subsets \(\mathcal {S}\) are set to 7, 10 and 50, experimentally. The MaER classifier has been executed 20 times and the performance of MaER classifier is evaluated in term of average accuracy, precision or PPV, recall or sensitivity, F-measure, MCC and area under the ROC curve (AUC) values. To compute these metrics the result of three class problem has been decomposed into three indivisible two class results by setting class-1 as an active and other classes are inactive, similarly, for class-2 and class-3. Thereafter, the average is computed and that also used to computer the average of 20 runs. Moreover, the effectiveness of MaER results has also been justified in terms of gain value. The gain is calculated in percentage as follows.

$$\begin{aligned} \begin{aligned} Gain = \frac{\textit{(MaER Predicted Accuracy - Referred Classifier Accuracy)}}{\textit{(Referred Classifier Accuracy)}}\times 100~ \end{aligned} \end{aligned}$$
(4)
Table 2. Performance comparison of MaER based HLA II-peptide binding predictor with other classifiers in terms of average Accuracy, Precision, Recall, F-measure, MCC and AUC values
Fig. 2.
figure 2

Best ROC plots of HLA class II protein-peptide binding prediction for (i) DP, (ii) DQ and (iii) DR

Table 2 reports the performance of all classifiers including MaER on HLA II-peptide binding prediction. From the results, it can be clearly stated that the MaER performing better than the other classifiers. Moreover, the result is also suggesting that it can be used as a potentially computation tool for discovering multiclass HLA class II binding epitopes that has a great importance in vaccinology. The best ROC curves of all the classifiers are shown in Fig. 2. The curves produced by MaER for DP, DQ and DR, are showing the average AUC values like 0.92, 0.93 and 0.91. Moreover. The MaER classifier achieves the average gain values of 2.18 %, 15.45 %, 29.25 % and 2.81 % over SVM, DT, NB and K-NN classifiers, respectively. Finally, Friedman test based has been conducted average accuracy values of all classifiers to judge the statistical significance of the predicted results. The test produced a Chi-square value of 86.81 and p-value of 0.126\(\times 10^{-5}\) at \(\alpha \)= 0.05 significance level. The results provide a strong evident in order to rejecting the null hypothesis, that means there is a significant difference in the results produced by various classifies, while MaER produced the best results among them.

4 Conclusions

In this article, the binding prediction of HLA II isotypes is considered as a multiclass problem. For this purpose, new multiclass training and testing datasets of HLA II proteins have been prepared after analysing raw binding affinity interaction dataset of 27 HLA II proteins and 636 peptides. Thereafter, an ensemble based multiclass classifier, named as Meta EnsembleR (MaER) has been developed for the same problem. MaER is an ensemble based classifier which avoids the weakness of a single classifier while improving the prediction performance by integrating the outputs of multiple heterogeneous classifiers. It generally pre-processes the original training and testing datasets by making feature subsets, bootstrap samples and creates diverse datasets using principle component analysis. The efficacy of the developed MaER has been demonstrated in comparison with support vector machine, decision tree, naive bayes and \(K\)-nearest neighbor on newly generated test data in terms of average accuracy, precision, recall, F-measure, MCC, area under the ROC curve (AUC) and gain values. It is observed that MaER achieves the maximum gain of \(29.25\,\%\) over Naive Bayes classifier. Finally, the statistical significance of the results produced by MaER has been justified by the Friedman test.

The application of the MaER method could be particularly beneficial wherever the informations about HLA isotype propensity and coverage are crucial. One example is the design of peptide-based vaccines, where the identification of the epitopes able to interact with different HLA isotypes, is a crucial factor for vaccine efficacy and population coverage. Another case is the study of autoimmune diseases, where the detection of self epitopes showing extensive cross reactivity with several HLA isotypes, is indeed a central issue. Apart from this, MaER can be used to find the potential markers from gene expression data [32, 33].