Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings

Liu, Fang; Yuan, ChengCheng; Chen, Haoqiang; Yang, Fei

doi:10.1038/s41598-024-53028-w

Download PDF

Article
Open access
Published: 30 January 2024

Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings

Fang Liu¹,
ChengCheng Yuan²,
Haoqiang Chen¹ &
…
Fei Yang²

Scientific Reports volume 14, Article number: 2464 (2024) Cite this article

725 Accesses
Metrics details

Subjects

Abstract

Linear B-cell epitopes (BCEs) play a key role in the development of peptide vaccines and immunodiagnostic reagents. Therefore, the accurate identification of linear BCEs is of great importance in the prevention of infectious diseases and the diagnosis of related diseases. The experimental methods used to identify BCEs are both expensive and time-consuming and they do not meet the demand for identification of large-scale protein sequence data. As a result, there is a need to develop an efficient and accurate computational method to rapidly identify linear BCE sequences. In this work, we developed the new linear BCE prediction method LBCE-BERT. This method is based on peptide chain sequence information and natural language model BERT embedding information, using an XGBoost classifier. The models were trained on three benchmark datasets. The model was training on three benchmark datasets for hyperparameter selection and was subsequently evaluated on several test datasets. The result indicate that our proposed method outperforms others in terms of AUROC and accuracy. The LBCE-BERT model is publicly available at: https://github.com/Lfang111/LBCE-BERT.

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Article 08 May 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

De novo generation of multi-target compounds using deep generative chemistry

Article Open access 06 May 2024

Introduction

B cells, also known as B lymphocytes, play an extremely important role in the mammalian immune response. They differentiate into plasma cells in response to antigenic stimulation by bacteria and viruses, producing antibodies to combat bacterial and viral infections¹. The fragment of the antigen molecule that specifically binds to the B-cell surface receptor or antibody is called the B-cell epitope (BCE)². B-cell epitopes can either be consecutive amino acid residues in the antigen protein sequence, known as linear epitopes, or they can be discontinuous amino acid residues that interact with each other to fold the protein sequence into a three-dimensional conformational structure, known as conformational epitopes^3,4.

The identification of BCE has greatly contributed to the development of biomedicine, for example, in the overall understanding of immune response mechanisms with respect to the design and development of relevant vaccines^5,6. Experimental methods for identifying BCEs in the field of biology include X-ray crystallography, cryo-EM, nuclear magnetic resonance, hydrogen–deuterium exchange coupled to mass spectroscopy, and peptide-based approaches, etc.⁷ However, these methods are generally expensive, time-consuming, and labour-intensive⁸. The volume of biological data has grown rapidly in recent years and traditional experimental methods can no longer cope with such a large volume of data. Therefore, there is need to develop sequence-based computational methods to identify potential BCEs quickly and accurately⁹. Various computational methods have been published for predicting conformational or linear BCEs. Although 90% of these BCEs are for conformational epitopes and only 10% are for linear BCEs, linear BCEs have remained a focus of research in recent years^10,11.

Early prediction methods for BCE, such as Antigenic¹², PEOPLE¹³, BEPITOPE¹⁴, and BcePred¹⁵, only utilized the physicochemical properties of amino acids. With the development of computer technology, researchers have combined these methods with machine learning techniques to construct models that incorporate multiple physicochemical properties of proteins, resulting in new prediction methods. For example, the BepiPred¹⁶ predictor utilises a combined approach that incorporates the amino acid propensity scale and hidden Markov models (HMM). Chen et al.¹⁷ refined the single amino acid propensity scale by creating the Amino Acid Pairs (AAP) antigenicity scale. Experiments based on a support vector machine (SVM) classifier showed that the AAP antigenicity scale approach outperformed other methods in BCE prediction. As a result, a new BCE predictor, AAPPred, was developed¹⁸. Among the various machine learning models that have been developed to predict BCE, SVM has emerged as the dominant model. Many researchers have used this model to train BCE predictors based on different datasets that combine a variety of amino acid features, such as BCPred¹⁹, FBCPred²⁰, COBEpro²¹, SVMTrip²², LBtope²³, IgPred²⁴, APCpred²⁵, and EpitopVec²⁶. BepiPred-2.0⁴ and iLBE²⁷ are other machine learning models that have been used to predict BCE. Both models use random forest(RF) for epistasis classification, while iBCE-EL⁹ uses extraordinarily randomized tree(ERT) for model training. Sahu T K et al.²⁸ focused on evaluating 18 coding methods based on SVM and RF. In addition to traditional machine learning models, deep learning models have also been applied to BCE classification. For example, the predictor ABCpred²⁹, developed in 2006, used a recurrent neural network(RNN). DLBEpitope³⁰ and EpiDope³¹ which used feedforward neural network(FNN) and deep neural networks (DNN) training, respectively.

This study proposes a new method, LBCE-BERT, for predicting linear BCEs. The method combines traditional amino acid residue features, sequence features, and the semantics of the BERT (Bidirectional Encoder Representation from Transformers) model used in natural language processing embedding to form an optimal feature set. The effectiveness of the method was validated on multiple datasets.

Materials and method

This study involved collecting datasets, extracting of protein sequence information and converting it into matrix information, and inputting the acquired matrix information into a machine learning model for training and hyperparameter optimization. Furthermore, the method was compared with existing approaches. The experimental structure of this work is shown in Fig. 1.

Benchmark datasets

Most of the datasets used in the BCE predictors developed to date have been obtained from the Immune Epitope Database (IEDB)³² or the Bcipep³³ database, where the data have been experimentally demonstrated. We collected the main benchmark datasets used by existing prediction models. Among them, the early proposed ABCPred²⁹, Chen¹⁷ and BCPreds¹⁹ methods all collect data from the Bcipep database and construct their respective benchmark datasets. Other methods proposed subsequently, such as SVMTrip³⁴, iBCE-EL⁹, BepiPred2.0⁴, DLBEpitope³⁰ and EpiDope³¹, use data collected from IEDB. To ensure a fair and comprehensive comparison with previous methods, we used the benchmark datasets shown in Table 1. The dataset of ABCPred, Chen and BCPreds are all from the Bcipep database and the sequence length is fixed. Whereas the data of LBtope and iBCE-EL are all from the IEDB database and only the sequence length of LBtope is fixed. Therefore, we choose BCPreds, LBtope and iBCE-EL_training as the training dataset and the other datasets as the independent test set, respectively.

Table 1 The dataset used in the benchmarking of our method.

Full size table

BCPreds dataset

The BCPreds dataset was created by El-Manzalawy et al.¹⁹. First, sequence data for the B-cell epitopes were collected from the Bcipep database. Amino acid residues were then added or deleted at the ends of the original sequences so that each peptide chain was 20 amino acid residues in length. Next, 80% sequence identity was used as the threshold and duplicate or highly homologous sequences were removed using CD-HIT, leaving 701 epitope sequences as the final positive samples. The same number of non-epitope sequences of the same length were then obtained from Swiss-Prot³⁵ as negative samples.

LBtope dataset

The LBtope dataset was one of the first datasets to obtain BCEs from the IEDB database, with five datasets created by Singh et al.²³. In this work we have only used the ‘Lbtope_Fixed_non_redundant’ dataset, which uses the same data processing techniques as the BCPreds dataset.

iBCE-EL dataset

The iBCE-EL dataset was collected from the IEDB database. This dataset only includes sequence data that have been experimentally proven to be BCE epitopes twice or more. The original sequence length is maintained, and CD-HIT is used with a set threshold of 70% to reduce sequence homology. Unlike LBtope, the dataset does not add or subtract from the original sequence length.

Feature representation of peptides

Amino acid composition

Amino acid composition (AAC)²⁶ is a representation of the frequency of occurrence of each amino acid in a segment of a peptide chain, which can be expressed as

$$\begin{array}{c}{F}_{AAC}=\left({f}_{1},{f}_{2},{f}_{3},\dots ,{f}_{20}\right)\end{array}$$

(1)

where ${f}_{i}=\frac{{R}_{i}}{N}$ ($i=\mathrm{1,2},3,\dots ,20$) refers to the proportion of this amino acid in this peptide chain, ${R}_{i}$ refers to the $i$th amino acid and N refers to the length of this peptide chain.

Amino acid pair antigenicity scale

The Amino Acid Pair (AAP) antigenicity scale was first proposed and used by Chen et al.¹⁷. First, the ratio of the frequency of occurrence of amino acid pairs in positive samples to the frequency of occurrence of amino acid pairs in negative samples was calculated, and the resulting ratio was logarithmically normalised to [− 1, 1]. Positive samples were obtained from the Bcipep database and negative samples were obtained from the Swiss-Prot³⁶ database, where negative samples were selected for all protein sequence information in the database except for B-cell epitope sequences. AAP antigenicity scale is calculated as

$$\begin{array}{c}{R}_{AAP}={\text{log}}\left(\frac{{f}_{AAP}^{+}}{{f}_{AAP}^{-}}\right)\end{array}$$

(2)

where ${f}_{AAP}^{+}$ and ${f}_{AAP}^{-}$ are the occurrences of specific dipeptides in BCEs and non-BCEs, respectively.

Amino acid trimer antigenicity scale

The Amino Acid Trimer (AAT) antigenicity scale was first used by researchers in SVMTrip³⁴ and is similar to the AAP antigenicity scale, except that it targets amino acid trimers rather than amino acid pairs and results are also normalised to [− 1,1] on the propensity scale. The AAT antigenicity scale is calculated as

$$\begin{array}{c}{R}_{AAT}={\text{log}}\left(\frac{{f}_{AAT}^{+}}{{f}_{AAT}^{-}}\right)\end{array}$$

(3)

where ${f}_{AAT}^{+}$ and ${f}_{AAT}^{-}$ are the occurrences of specific tripeptide in BCEs and non-BCEs, respectively.

Sequence embeddings of BERT

The BERT (bidirectional encoder representations from transformers) model proposed by Devlin et al.³⁷, has gained significant recognition in the field of natural language processing due to its exceptional performance. As a result, it has been widely adopted across various domains, including bioinformatics. For example, Qiao et al.³⁸ developed BERT-Kcr, a predictor for identifying protein lysine crotonylation sites use the BERT model. Similary Liu et al.³⁹ created BERT-Kgly, a novel predictor for lysine glycosylation sites by combining features extracted from a pre-trained protein language BERT model with a deep learning model.

BERT focuses on using a new masked language model (MLM) to train a bidirectional transformer for creating deep bidirectional language representations. The coding layer of this mechanism uses a multi-head self-attention approach to process both left and right contexts simultaneously, allowing for parallel processing of all words in a sentence. The attention mechanism is based on three main concepts: Query (the target word, or the annotated word to be generated), Value (the original Value representation of each word in the context), and Key (the Key vector representation of each word in the context). In the multi-headed self-attentive mechanism, it first uses $h$ linear transformations with different parameters to project Query, Value and Key, followed by inputting each of the transformed $h$ sets of vectors to the self-attentive layer. After the self-attentive layer, the model obtains different attentional results. These results are then combined to create an information representation of the different subspaces. The main calculation procedure is shown in the following equation:

$$\begin{array}{c}MultiHead\left({\text{Q}},{\text{K}},{\text{V}}\right) = Concat\left({{\text{head}}}_{1},\cdots ,{{\text{head}}}_{{\text{h}}}\right){{\text{W}}}^{{\text{O}}}\end{array}$$

(4)

$$\begin{array}{c}{{\text{head}}}_{i} = Attention\left({{\text{QW}}}_{{\text{i}}}^{{\text{Q}}},{{\text{KW}}}_{{\text{i}}}^{{\text{K}}},{{\text{VW}}}_{{\text{i}}}^{{\text{V}}}\right)\end{array}$$

(5)

$$\begin{array}{c}Attention\left({\text{Q}},{\text{K}},{\text{V}}\right) = softmax\left(\frac{{{\text{QK}}}^{{\text{T}}}}{\sqrt{{{\text{d}}}_{{\text{k}}}}}\right)V\end{array}$$

(6)

where Q, K, V represent the three vectors Query, Key, Value. ${d}_{k}$ is the dimension of Key, and ${W}_{i}^{Q}$, ${W}_{i}^{V}$, ${W}_{i}^{V}$ and ${W}^{O}$ are parameter matrices⁴⁰. When a sentence is fed into the BERT model, the input vector for each word consists of three components: token embedding, segment embedding and position embedding. Where the token embedding represents each word present in the dictionary, encoded based on a different partitioning method. The fragment embedding indicates whether the word belongs to the first or second half of the sentence. The formula for encoding the positional embedding, which represents the position of a word in a sentence, is as follows:

$$\begin{array}{c}{{\text{PE}}}_{\left(pos,2i\right)} = sin\left(\frac{pos}{{10000}^{\frac{2i}{{d}_{model}}}}\right)\end{array}$$

(7)

$$\begin{array}{c}{{\text{PE}}}_{\left(pos, 2i+1\right)} = cos\left(\frac{pos}{{10000}^{\frac{2i}{{d}_{model}}}}\right)\end{array}$$

(8)

where pos is the position, $i$ is the component position of the vector d_model represents the dimension of the vector³⁷. Then, context-dependent features can be obtained from various encoder layers of the model.

Machine learning methods

The XGBoost (eXtreme Gradient Boosting)⁴¹ classifier was used in this study to build the model. It is based on a modification of the gradient boosting decision tree (GBDT)⁴². This classifier is a modified version of the gradient boosting decision tree (GBDT), which combines multiple regression trees to predict values that are as close to the true values as possible and have strong generalisation power. The model has two main advantages: it is regularised to prevent overfitting and supports parallelisation, which can greatly speed up training.

Evaluation metrics

This study employed five commonly used evaluation metrics to assess the model: accuracy (ACC), precision (Pre), sensitivity (Sn), F1 score (F1) and Matthews correlation coefficient (MCC). Additionally, we also calculated the receiver operating characteristic (ROC) curve and calculated the area under the receiver operating characteristic (ROC) curve (AUROC).

$$\begin{array}{c}Sn=\frac{TP}{TP+FN}\end{array}$$

(9)

$$\begin{array}{c}Pre=\frac{TP}{TP+FP}\end{array}$$

(10)

$$\begin{array}{c}ACC=\frac{TP+TN}{TP+TN+FP+FN}\end{array}$$

(11)

$$\begin{array}{c}F1=\frac{2TP}{2TP+FP+FN}\end{array}$$

(12)

$$\begin{array}{c}MCC=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}\end{array}$$

(13)

where TP, TN, FP and FN signify the numbers of true positives, true negatives, false positives and false negatives, respectively.

Results

Sequence discrepancy between positive and negative samples in the benchmark dataset

We developed a machine learning method based on peptide sequence information to distinguish between BCEs and non-BCEs. The assumption of different sequence patterns for positive and negative samples was taken into account. For overall pattern differences, we visualized them by two sample markers, the distribution and preference of epitope and non-epitope residues⁴³. Figure 2 illustrates that residue P is significantly enriched in several positions in the positive samples, while residues T, G, and D are enriched to varying degrees. In contrast, residue L is significantly depleted in several positions in the negative samples, along with residues I and F. The overall enrichment or depletion rate for specific sequential positions was 9.4%. Figure 3 shows an enrichment or depletion ratio of 6.9% for specific sequence positions in the benchmark dataset LBtope. Residues P and Q are enriched to a higher degree in positive samples, while residues L, V, and A are depleted to a higher degree in negative samples. It is worth noting that the sum of the enrichment and depletion rates of the different residues at each position implies a difference between the positive and negative samples at that position.

Model based on embeddings of pretrained BERT models

In this work, we directly used the BERT pre-training model constructed by Zhang et al.⁴⁴ based on 556,603 protein sequences to encode peptide chain sequences and extract their embeddings of tokens as features. To experiment the effectiveness of this feature encoding approach, we extracted the embedding of the marker ‘CLS’ and the average embedding of all amino acids in the entire peptide chain as features from the epitope and non-epitope sequences in the BCPreds, iBCE-EL_training and LBtope datasets, respectively. The ‘CLS’ is always the first token of each classification sequence, as described in Devlin et al.⁴⁵. Next, we built a model to predict linear BCE using the XGBoost classifier and trained it with various features. Table 2 shows the results of the five-fold cross-validation of the different models. The results show that the model trained with the embedding of the token ‘CLS’ consistently outperforms the model trained with the average embedding of all amino acids in the entire peptide chain, regardless of which dataset the model is trained on in BCPreds, iBCE-EL_training and LBtope. Therefore, in the next work, we will only use the embedding of the token ‘CLS’ from the protein BERT pre-training model as features for our experiments. The experimental results based on the BCPreds dataset were optimal in the above three datasets, and the combination of using embedding of the token ‘CLS’ as a feature to train the model achieved an AUC value of 0.693 and an ACC value of 0.629 for the five-fold cross-validation. Using the average embedding of all amino acids as a feature, the model achieved a five-fold cross-validation AUC value of 0.671 and an ACC value of 0.619.

Table 2 Cross-validation results based on two kinds of BERT embedding.

Full size table

Evaluate model performance based on different datasets

The AAC, AAP, AAT and embedding are combined with the token ‘CLS’ in the protein BERT pre-training model. This is then input into XGBoost for training to find hyperparameters. It is important to note that although variable-length linear BCEs exist in these datasets, these four feature representations enable the extraction of variable-length linear BCEs as fixed-dimension matrix information. As a result, the variable-length linear BCE does not affect the training model. We trained three different BCE prediction models using BCPreds, LBtope, and iBCE-EL_training datasets. These models are labelled LBCE-BERT(BCPreds), LBCE-BERT(LBtope) and LBCE-BERT(iECB-EL). To evaluate the model’s performance, we performed a five-fold cross-validation and independent dataset validation based on seven different datasets (BCPreds dataset, Chen dataset, ABCPred dataset, Blind387 dataset, LBtope dataset, iBCE-EL_training dataset, iBCE-EL_independent dataset), where the evaluation metrics for models are calculated as shown in section “Evaluation metrics”. The following are the results and correlation analysis of the five-fold cross-validation of the three models.

To observe the test performance of model strained on BCPreds, LBtope, and iBCE-EL_training on other datasets. We plotted AUROC curves for BCPreds, Chen, ABCPred, Blind387, LBtope, iBCE-EL_training, and iBCE-EL_independent for their performance on the three models are shown in Figs. S1–S7, respectively.

Evaluation of model LBCE-BERT(BCPreds)

Table 3 shows that our proposed method, based on the BCPreds dataset with five-fold cross-validation, has an AUROC of 0.924. This is a 3% improvement over the EpitopVec model, which was also trained on the same dataset. We then tested this model on other datasets and found that it performed well in both Chen and ABCPred. As shown in Table 4, our method achieves a higher AUROC of 0.990 on the Chen dataset compared to EpitopVec (the current optimal method) which achieved 0.959. Additionally, our method has an ACC of 0.941, representing a 21.6% improvement over the original method. On the Chen dataset, the models’ prediction accuracy ranged from 0.494 to 0.941, the lowest performing methods were iBCE-EL and LBtope, with ACC values of 0.494 and 0.533, respectively. In the ABCPred dataset (Table 5), the accuracy of the original method (ABCPred) is only 0.6593, which is lower than that of AAP(0.7314), BCPreds(0.7457), and EpitopVec(0.856). Additionally, LBCE-BERT(BCPreds) has the highest AUROC value of 0.934 among all methods, indicating that our model’s performance is stable and efficient. The Blind387 dataset was also tested, and the results are shown in Table 6. This dataset consist of epitopes derived from viruses and was obtained from published literature. Our model’s performance on this dataset was mediocrely, with an ACC of 0.705, which is 4.1% higher than the original method used to build the dataset but slightly lower than the optimal method, EpitopVec. Similarly, the AUROC value of 0.757 is only 2.2% lower than the optimal method. Chen, ABCPred, LBtope and iBCE-EL are some of the earlier methods for obtaining BCE datasets using the IEDB database, compared to BCPreds. According to the results in Table 7, LBCE-BERT(BCPreds), BCPreds, iBCE-EL, and EpitopVec(BCPreds) have similar accuracy on the LBtope dataset. However, on the training and independent test sets of iBCE-EL (Table 8), EpitopVec(BCPreds) slightly outperformed LBCE-BERT(BCPreds).

Table 3 Multi-method prediction results on the BCPreds dataset.

Full size table

Table 4 Multi-method prediction results on the Chen dataset.

Full size table

Table 5 Multi-method prediction results on the ABCPred dataset.

Full size table

Table 6 Multi-method prediction results on the Blind387 dataset.

Full size table

Table 7 Multi-method prediction results on the LBtope dataset.

Full size table

Table 8 Multi-method prediction results on the iECB-EL dataset.

Full size table

Evaluation of model LBCE-BERT(LBtope)

Table 7 shows the results of the five-fold cross-validation based on the LBtope dataset. The model trained on this dataset performs slightly worse. On the BCPreds dataset (Table 3), the AUROC values for LBCE-BERT(LBtope), EpitopVec(LBtope), and ABCPred were 0.641, 0.645, and 0.643, respectively. The ACC values for LBtope and iBCE-EL were 0.5157 and 0.4871, respectively. In the Chen dataset (Table 4), our model is the second-best performing method, following the original proposed method. The lowest performing method is iBCE-EL, with an accuracy of only 0.494. LBCE-BERT(LBtope) did not perform well in both ABCPreds and Blind387, with AUROC values of 0.617 and 0.656 on these two datasets, respectively. Additionally, based on the LBtope dataset, LBCE-BERT (LBtope) achieved an AUROC of 0.733 and an ACC of 0.671. Other methods achieved ACC values ranging from 52.2 to 69%. Furthermore, as shown in Fig. 3, our model’s accuracy was slightly higher than EpitopVec(LBtope) but lower than the original method on both the training and independent test sets of iBCE-EL. However, our model’s AUROC was the lowest among the three models.

Evaluation of model LBCE-BERT(iBCE-EL)

In the five-fold cross-validation of our model based on the iBCE-EL training set, Table 8 shows that the AUROC value of 0.820 is 2.7% higher than the current optimal model EpitopeVec of 0.793. The original method presenting this data is only 0.782, proving that our method LBCE-BERT(iBCE-EL) is the current optimal model. In the iBCE-EL independent test set, the original method (iBCE-EL) achieved an AUROC of 0.786. EpitopeVec(iBCE-EL) achieved an AUROC of 0.785, while LBCE-BERT(iBCE-EL) achieved an AUROC of 0.828 and an ACC of 0.757. These results suggest that LBCE-BERT(iBCE-EL) has better generalisation ability. This model was also evaluated on other datasets. The results of LBCE-BERT (iBCE-EL) on each dataset BCPreds (Table 3), Chen (Table 4), ABCPred (Table 5), Blind387 (Table 6) and LBtope (Table 7), showed little various, with only a 0.017 different in ACC values and a 0.027 different in AUROC values. LBCE-BERT (iBCE-EL) achieved an ACC of 0.603 in the BCPreds dataset, 0.586 in Chen and 0.586 in ABCPred. In Blind387, LBCE-BERT (iBCE-EL) achieved an ACC value of 0.589, which is only better than the iBCE-EL method’s ACC of 0.434. Other methods, such as ABCPred(0.6641), AAP(0.6460), BCPreds(0.6589) and EpitopeVec(0.693), outperformed LBCE-BERT(iBCE-EL). On the LBtope dataset, our model achieved an ACC of 0.595, which is higher than BCPreds(0.5256), iBCE-EL(0.522) and EpitopeVec(0.573).

Discussion and conclusions

In this study, we developed the LBCE-BERT method, a machine learning approach for predicting linear BCEs based on amino acid sequence features and the protein language model BERT embedding, and achieved good performance. The accurate prediction of linear BCEs using this method provides valuable insights for applications in biotechnology, such as the treatment and prevention of infectious diseases. We trained three different models based on the BCPreds, LBtope and iBCE-EL datasets respectively, and compared various methods, including the original method used to propose these datasets and the current optimal method for predicting BCE. The experimental results showed that the LBCE-BERT (BCPreds) and LBCE-BERT (iBCE-EL) models performed better than other models. However, the LBCE-BERT (LBtope) model performed slightly worse than EpitopVec. The difficult to classify sequences in the LBtope data set can be seen in the sequence analysis, showing lower overall enriched or depleted ratios for specific sequential positions compared to the BCPreds dataset. In contrast, the LBCE-BERT (iBCE-EL) achieves better performance on the dataset iBCE-EL, which is obtained from the same IEDB database as the dataset LBtope. This may be due to the original sequence being altered during addition or deletion processing, which can result in the loss of sequence information regarding the table position.

Furthermore, our model exhibited exceptional performance when validated with other datasets, indicating its strong generalization capability. However, during cross-validation on several datasets, all of which were derived from the Bcipep database, we observed a significant decrease in the predictive performance of one class of models. These models were primarily trained on datasets derived from the IEDB database. Similarly, when cross-validating on multiple datasets obtained from the IEDB database, we obtained the same results for models trained on datasets derived from the Bcipep database. These findings indicate that models trained on datasets from the same database may exhibit similarities. Therefore, it is important to diversify training sets for computational methods.

In conclusion, this study presents the LBCE-BERT method, which effectively predicts linear BCEs, and demonstrates its robust performance across multiple datasets. These findings contribute to the field of BCE prediction and offer valuable insights for future research in this area.

Data availability

All datasets and source codes of LBCE-XGB are publicly available on https://github.com/Lfang111/LBCE-BERT.

References

Sanchez-Trincado, J. L., Gomez-Perosanz, M., & Reche, P. A. Fundamentals and Methods for T- and B-Cell Epitope Prediction. J. Immunol. Res. (2017).
Collatz, M. et al. EpiDope: A deep neural network for linear B-cell epitope prediction. Bioinformatics 37(12), 1784–1784 (2021).
Article PubMed Google Scholar
Deng, X. X., Storz, U. & Doranz, B. J. Enhancing antibody patent protection using epitope mapping information. Mabs 10(2), 204–209 (2018).
Article CAS PubMed Google Scholar
Jespersen, M. C., Peters, B., Nielsen, M. & Marcatili, P. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 45(W1), W24–W29 (2017).
Article CAS PubMed PubMed Central Google Scholar
Peters, B. et al. The design and implementation of the immune epitope database and analysis resource. Immunogenetics 57(5), 326–336 (2005).
Article CAS PubMed PubMed Central Google Scholar
Gomara, M. J. & Haro, I. Synthetic peptides for the immunodiagnosis of human diseases. Curre. Med. Chem. 14(5), 531–546 (2007).
Article CAS Google Scholar
Abbott, W. M., Damschroder, M. M. & Lowe, D. C. Current approaches to fine mapping of antigen-antibody interactions. Immunology 142(4), 526–535 (2014).
Article CAS PubMed PubMed Central Google Scholar
Attique, M., Farooq, M. S., Khelifi, A. & Abid, A. Prediction of therapeutic peptides using machine learning: Computational models, datasets, and feature encodings. Ieee Access 8, 148570–148594 (2020).
Article Google Scholar
Manavalan, B., Govindaraj, R. G., Shin, T. H., Kim, M. O., & Lee, G. iBCE-EL: A new ensemble learning framework for improved linear B-cell epitope prediction. Front. Immunol. 9 (2018).
Galanis, K. A. et al. Linear B-cell epitope prediction for in silico vaccine design: a performance review of methods available via command-line interface. Int. J. Mol. Sci. 22(6) (2021).
Bukhari, S. N. H., Jain, A., Haq, E., Mehbodniya, A. & Webber, J. Machine learning techniques for the prediction of B-cell and T-cell epitopes as potential vaccine targets with a specific focus on SARS-CoV-2 pathogen: a review. Pathogens 11(2), 146 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kolaskar, A. S. & Tongaonkar, P. C. A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett 276(1–2), 172–174 (1990).
Article CAS PubMed Google Scholar
Alix, A. J. P. Predictive estimation of protein linear epitopes by using the program PEOPLE. Vaccine 18(3–4), 311–314 (1999).
Article CAS PubMed Google Scholar
Odorico, M. & Pellequer, J. L. BEPITOPE: Predicting the location of continuous epitopes and patterns in proteins. J. Mol. Recognit. 16(1), 20–22 (2003).
Article CAS PubMed Google Scholar
Saha, S. & Raghava, G. P. S. BcePred: Prediction of continuous B-cell epitopes in antigenic sequences using physico-chemical properties. Artif. Immune Syst. Proceed. 3239, 197–204 (2004).
Article Google Scholar
Larsen, J. E. P., Lund, O. & Nielsen, M. Improved method for predicting linear B-cell epitopes. Immunome Res. 2(1), 2 (2006).
Article PubMed PubMed Central Google Scholar
Chen, J., Liu, H., Yang, J. & Chou, K. C. Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33(3), 423–428 (2007).
Article CAS PubMed Google Scholar
Davydov, Y. I. & Tonevitsky, A. G. Prediction of linear B-cell epitopes. Mol. Biol. 43(1), 150–158 (2009).
Article CAS Google Scholar
El-Manzalawy, Y., Dobbs, D. & Honavar, V. Predicting linear B-cell epitopes using string kernels. J. Mol. Recognit. 21(4), 243–255 (2008).
Article CAS PubMed PubMed Central Google Scholar
El-Manzalawy, Y., Dobbs, D. & Honavar, V. Predicting flexible length linear B-cell epitopes. Comput. Syst. Bioinf. Comput. Syst. Bioinf. Conf. 7, 121–132 (2008).
Google Scholar
Sweredoski, M. J. & Baldi, P. COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng. Des. Select. 22(3), 113–120 (2009).
Article CAS Google Scholar
Yao, B., Zhang, L., Liang, S. D., Zhang, C. SVMTriP: A method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. Plos One, 7(9) (2012).
Singh, H., Ansari, H. R., & Raghava, G. P. S. Improved method for linear B-cell epitope prediction using antigen’s primary sequence. Plos One, 8(5) (2013).
Gupta, S., Ansari, H. R., Gautam, A., & Raghava, G. P. S. Open Source Drug D. Identification of B-cell epitopes in an antigen for inducing specific class of antibodies. Biol. Direct 8 (2013).
Shen, W. et al. Predicting linear B-cell epitopes using amino acid anchoring pair composition. Biodata Mining. 8 (2015).
Bahai, A. et al. EpitopeVec: Linear epitope prediction using deep protein sequence embeddings. Bioinformatics (2021).
Hasan, M. M., Khatun, M. S. & Kurata, H. iLBE for computational identification of linear B-cell epitopes by integrating sequence and evolutionary features. Genom Proteom Bioinf 18(5), 593–600 (2020).
Article Google Scholar
Sahu, T. K., Meher, P. K, Choudhury, N. K., & Rao, A. R. A comparative analysis of amino acid encoding schemes for the prediction of flexible length linear B-cell epitopes. Briefings Bioinf. 23(5) (2022).
Saha, S. & Raghava, G. P. S. Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins-Struct Funct Bioinf 65(1), 40–48 (2006).
Article CAS Google Scholar
Liu, T., Shi, K. & Li, W. Deep learning methods improve linear B-cell epitope prediction. BioData Mining 13(1), 1 (2020).
Article ADS PubMed PubMed Central Google Scholar
Collatz, M. et al. EpiDope: a deep neural network for linear B-cell epitope prediction. Bioinformatics 37(4), 448–455 (2021).
Article CAS PubMed Google Scholar
Vita, R. et al. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47(D1), D339–D343 (2019).
Article CAS PubMed Google Scholar
Saha, S., Bhasin, M., & Raghava, G. P. S. Bcipep: A database of B-cell epitopes. Bmc Genomics, 6 (2005).
Yao, B., Zhang, L., Liang, S., Zhang, C. SVMTriP: A method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. Plos One 7(9) (2012).
O’Donovan, C. et al. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Briefings Bioinf. 3(3), 275–284 (2002).
Article Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28(1), 45–48 (2000).
Article CAS PubMed PubMed Central Google Scholar
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).
Qiao, Y. H., Zhu, X. L. & Gong, H. P. BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 38(3), 648–654 (2022).
Article CAS PubMed Google Scholar
Liu, Y. et al. BERT-Kgly: A bidirectional encoder representations from transformers (BERT)-based model for predicting lysine glycation site for homo sapiens. Front. Bioinf. 2 (2022).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
Chen, T., & Guestrin, C. XGBoost: A scalable tree boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, Association for Computing Machinery. San Francisco, California. p. 785–794.
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5), 1189–1232 (2001).
Article MathSciNet Google Scholar
Vacic, V., Iakoucheva, L. M. & Radivojac, P. Two Sample Logo: A graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22(12), 1536–1537 (2006).
Article CAS PubMed Google Scholar
Zhang, Y., Lin, J. Y., Zhao, L. M., Zeng, X. X., & Liu, X. R. A novel antibacterial peptide recognition algorithm based on BERT. Briefings Bioinf. 22(6) 2021.
Devlin, J., Chang, M. W., Lee, K., Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Download references

Acknowledgements

The numerical calculations in this paper have been done on the Medical Big Data Supercomputing Center System of Anhui Medical University.

Funding

Supported by Natural Science Foundation of Anhui Province of China (No.2108085MH303).

Author information

Authors and Affiliations

School of Humanistic Medicine, Anhui Medical University, Hefei, 230032, Anhui, China
Fang Liu & Haoqiang Chen
School of Biomedical Engineering, Anhui Medical University, Hefei, 230030, Anhui, China
ChengCheng Yuan & Fei Yang

Authors

Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
ChengCheng Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Haoqiang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Fei Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceived the study: Y. F. Designed the study: L. F. Analyzed the data: Y. C. and C. H. Wrote the paper: L. F. All authors read and approved the manuscript.

Corresponding author

Correspondence to Fei Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, F., Yuan, C., Chen, H. et al. Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings. Sci Rep 14, 2464 (2024). https://doi.org/10.1038/s41598-024-53028-w

Download citation

Received: 06 September 2023
Accepted: 26 January 2024
Published: 30 January 2024
DOI: https://doi.org/10.1038/s41598-024-53028-w

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Highly accurate protein structure prediction with AlphaFold

De novo generation of multi-target compounds using deep generative chemistry

Introduction

Materials and method

Benchmark datasets

BCPreds dataset

LBtope dataset

iBCE-EL dataset

Feature representation of peptides

Amino acid composition

Amino acid pair antigenicity scale

Amino acid trimer antigenicity scale

Sequence embeddings of BERT

Machine learning methods

Evaluation metrics

Results

Sequence discrepancy between positive and negative samples in the benchmark dataset

Model based on embeddings of pretrained BERT models

Evaluate model performance based on different datasets

Evaluation of model LBCE-BERT(BCPreds)

Evaluation of model LBCE-BERT(LBtope)

Evaluation of model LBCE-BERT(iBCE-EL)

Discussion and conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links