On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach

Yu-Hui Qu; Hua Yu; Xiu-Jun Gong; Jia-Hui Xu; Hong-Shun Lee

doi:10.1371/journal.pone.0188129

Abstract

DNA-binding proteins play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions for both eukaryotic and prokaryotic proteomes. Predicting the functions of these proteins from primary amino acids sequences is becoming one of the major challenges in functional annotations of genomes. Traditional prediction methods often devote themselves to extracting physiochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a deep learning based method to identify DNA-binding proteins from primary sequences alone. It utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependencies, an binary cross entropy to evaluate the quality of the neural networks. When the proposed method is tested with a realistic DNA binding protein dataset, it achieves a prediction accuracy of 94.2% at the Matthew’s correlation coefficient of 0.961. Compared with the LibSVM on the arabidopsis and yeast datasets via independent tests, the accuracy raises by 9% and 4% respectively. Comparative experiments using different feature extraction methods show that our model performs similar accuracy with the best of others, but its values of sensitivity, specificity and AUC increase by 27.83%, 1.31% and 16.21% respectively. Those results suggest that our method is a promising tool for identifying DNA-binding proteins.

Citation: Qu Y-H, Yu H, Gong X-J, Xu J-H, Lee H-S (2017) On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach. PLoS ONE 12(12): e0188129. https://doi.org/10.1371/journal.pone.0188129

Editor: Bin Liu, Harbin Institute of Technology Shenzhen Graduate School, CHINA

Received: July 21, 2017; Accepted: November 1, 2017; Published: December 29, 2017

Copyright: © 2017 Qu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All the source codes and data used in this study are available from the figshare server https://doi.org/10.6084/m9.figshare.5231602.v1.

Funding: This work was supported by: (1) Natural Science Funding of China, grant number 61170177, http://www.nsfc.gov.cn, funding institutions: Tianjin University, authors: Xiu-Jun GONG, Hua Yu; (2) National Basic Research Program of China, grant number 2013CB32930X, http://www.most.gov.cn, funding institutions: Tianjin University; and (3) National High Technology Research and Development Program of China, grant number 2013CB32930X, http://www.most.gov.cn, funding institutions: Tianjin University, authors: Xiu-Jun GONG. The funders did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.

Competing interests: The authors have declared that no competing interests exist.

Introduction

One vital function of proteins is DNA-binding that play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions for both eukaryotic and prokaryotic proteomes [1]. Currently, both computational and experimental techniques have been developed to identify the DNA binding proteins. Due to the pitfalls of time-consuming and expensive in experimental identifications, computational approaches are highly desired to distinguish the DNA-binding proteins from the explosively increased amount of newly discovered proteins. So far, numerous structure or sequence based predictors for determining DNA-binding proteins have been proposed [2–4]. Structure based predictions normally gain high accuracy on the basis of availability of many physiochemical characters. However, they are only applied to small number of proteins with high-resolution three-dimensional structures. Thus, uncovering DNA binding proteins from their primary sequences alone is becoming an urgent task in functional annotations of genomics with the availability of huge volumes of protein sequence data.

In the past decades, a series of computational methods for identifying of DNA-binding proteins using only primary sequences have been proposed. Among these methods, building a meaningful feature set and choosing an appropriate machine learning algorithm are two crucial steps to make the predictions successful [5]. Cai et al. first developed the SVM algorithm, SVM-Prot, in which the feature set came from three protein descriptors, composition (C), transition (T) and distribution (D)for extracting seven physiochemical characters of amino acids [2]. Kumar et al. trained a SVM model using amino acid composition and evolutionary information in the form of PSSM profiles [1]. iDNA-Prot used random forest algorithm as the predictor engine by incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via a “grey model” [3]. Zou et al. trained a SVM classifier, in which the feature set came from three different feature transformation methods of four kinds of protein properties [4]. Lou et al. proposed a prediction method of DNA-binding proteins by performing the feature rank using random forest and the wrapper-based feature selection using a forward best-first search strategy [6]. Ma et al. used the random forest classifier with a hybrid feature set by incorporating binding propensity of DNA-binding residues [7]. Professor Liu’s group developed several novel tools for predicting DNA-Binding proteins, such as iDNA-Prot|dis by incorporating amino acid distance-pairs and reducing alphabet profiles into the general pseudo amino acid composition [8], PseDNA-Pro by combining PseAAC and physiochemical distance transformations [9], iDNAPro-PseAAC by combining pseudo amino acid composition and profile-based protein representation [10], iDNA-KACC by combining auto-cross covariance transformation and ensemble learning [11]. Zhou et al. encoded a protein sequence at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids for predicting protein interactions [5]. Also there are several general purpose protein feature extraction tools such as Pse-in-One [12] and Pse-Analysis [13]. They generated feature vectors by a user-defined schema and make them more flexible.

Deep learning is now one of the most active fields in machine learning and has achieved big success in computer vision [14], speech recognition [15] and natural language processing [16]. It is composed of multiple linear and non-linear transformations to model high-level abstractions by using a deep graph with multiple processing layers. Convolutional neural networks (CNN) and Long short term memory neural networks(LSTM) are two typical architectures of deep learning. Communities from computation biology are making efforts into deep learning to solve their biological problems [17] ranged from DNA, RNA binding specifity prediction [18–20] to protein secondary structure [21], folding [22], and contact map [23] recognitions. Most of them make predictions using not only sequences, but additional information, such as transcription [18] and evolutionary profiles [21]. Few of them use sequences information alone. Furthermore, Asgari et al. have derived a continuous distributed representation of biological sequences to make the development rapidly [24].

Since deep learning techniques have been successful in other disciplines, we aim to investigate whether deep learning networks could achieve notable improvements in the field of identifying DNA binding proteins only using sequence information. In this work, we propose a deep learning based method to predict DNA-binding proteins from primary sequences. The model utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependence, an binary cross entropy to evaluate the quality of the neural networks. It overcomes more human intervention in feature selection procedure than in traditional machine learning methods, since all features are learned automatically. It uses filters to detect the function domains of a sequence. The domain position information are encoded by feature maps produced by the LSTM. Intensive experiments show its remarkable prediction power with high generality and reliability.

Materials and methods

Data sets

The raw protein sequences are extracted from the Swiss-Prot dataset, a manually annotated and reviewed subset of UniProt. It is a comprehensive, high-quality and freely accessible database of protein sequences and functional information. We collect 551, 193 proteins as the raw dataset from the release version 2016.5 of Swiss-Prot.

To obtain DNA-Binding proteins, we extract sequences from raw dataset by searching keyword “DNA-Binding”, then remove those sequences with length less than 40 or greater than 1,000 amino acids. Finally 42,257 protein sequences are selected as positive samples. We randomly select 42,310 non-DNA-Binding proteins as negative samples from the rest of the dataset by using the query condition “molecule function and length [40 to 1,000]”. For both of positive and negative samples, 80% of them are randomly selected as the training set, rest of them as the testing set. Also, to validate the generality of our model, two additional testing sets (Yeast and Arabidopsis) from literature [25] are used. See Table 1 for details.

Download:

Table 1. Equal data set.

https://doi.org/10.1371/journal.pone.0188129.t001

In reality, the number of none-DNA-binding proteins is far greater than the one of DNA-binding proteins and the majority of DNA-binding protein data sets are imbalanced. Therefore we simulate a realistic data set by using the same positive samples in the equal set, and using the query conditions ‘molecule function and length [40 to 1,000]’ to construct negative samples from the dataset which doesn’t include those positive samples, see Table 2. The validation datasets were also obtained using the method in the literary [25], adding a condition ‘(sequence length ≤ 1000)’. Finally 104 sequences with DNA-binding and 480 sequences without DNA-binding were obtained.

Download:

Table 2. Realistic data set.

https://doi.org/10.1371/journal.pone.0188129.t002

In order to further verify the generalization of the model, multi-species datasets including human, mouse and rice species are constructed using the method above. For the details, see Table 3.

Download:

Table 3. Multi-species data set.

https://doi.org/10.1371/journal.pone.0188129.t003

For the traditional sequence-based classification methods, the redundancy of sequences in the training dataset often leads to over-fitting of the prediction model. Meanwhile, sequences in testing sets of Yeast and Arabidopsis may be included in the training dataset or share high similarity with some sequences in training dataset. These overlapped sequences might result in the pseudo performance in testing. Thus, we construct low-redundancy versions of both equal and realistic datasets to validate if our method works on such situations. We first remove the sequences in the datasets of Yeast and Arabidopsis. Then the CD-HIT tool with lowest threshold value 0.7 is applied to remove the sequence redundancy, see Table 4 for details of the datasets.

Download:

Table 4. Low-redundancy versions of the equal and realistic datasets.

https://doi.org/10.1371/journal.pone.0188129.t004

Methods

Just like the natural language in the real world, letters working together in different combinations construct words, words combining with each other in different ways form phrases. Processing words in a document can convey the topic of the document and its meaningful content. In this work, a protein sequence is analogous to a document, amino acid to word, and motif to phrase. Mining relationships among them would yield higher level information on the behavioral properties of the physical entities corresponding to the sequences.

Deep learning model structure.

The proposed deep learning model consists of four layered components: an encoding layer, an embedding layer, a CNN layer and a LSTM layer, shown in Fig 1. The encoding layer maps a sequence to a fixed length digital vector. The embedding layer translates it into a continuous vector. Similar to the word2vec model, transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of individual amino acid. The CNN layer consists of two convolutional layers, each followed by a max pooling operation. The CNN can enforce a local connectivity pattern between neurons of layers to exploit spatially local structures. Specifically, the CNN layer is used to capture non-linear features of protein sequences, e.g. motifs, and enhances high-level associations with DNA binding functions. The Long Short-Term Memory (LSTM) networks capable of learning order dependence in sequence prediction problems are used to learn long-term dependencies between motifs.

Download:

Fig 1. Architecture of the deep learning model.

https://doi.org/10.1371/journal.pone.0188129.g001

A given protein sequence S, after four layer processing, an affinity score f(s) to be a DNA-binding protein is calculated by the Eq 1.

(1)

After that, a sigmoid activation is applied to predict the function label of a protein sequence and an binary cross-entropy is applied to assess the quality of networks. The whole process is trained in the back propagation fashion. Fig 1 shows the details of the model. To illustrate how the proposed method works, an example sequence S = MSFMVPT is used to show products after each processing.

Protein sequence encoding.

Feature encoding is a tedious but critical work for building a statistical machine learning model in most of protein sequence classification tasks. Various approaches, such as homology-based methods, n-gram methods, and physiochemical properties based extraction methods, etc, have been proposed. Although those methods work well in most scenarios, human intensive involvement lead to less useful practically. One of the most success in the emerging deep learning technology is its capability in learning features automatically. In order to verify its generality, we just assign each amino acid a nature number, see Table 5. It should be noted that the orders of amino acids have no effects on the final performance.

Download:

Table 5. The amino acids encoder.

https://doi.org/10.1371/journal.pone.0188129.t005

The encoding stage just generates a fixed length digital vector of a protein sequence. If its length is less than the “max_length”, a special token “X” is filled in the front. As the example sequence, it becomes 2 after the encoding.

(2)

Embedding stage.

The vector space model is used to represent words in natural language processing. Embedding is a map process that each word in the discrete vocabulary will be embed into a continuous vector space. In this way, Semantically similar words are mapped to similar regions. This is done by simply multiplying the one-hot vector from left with a weight matrix W ∈ R^{d × |V|}, where |V| is the number of unique symbols in a vocabulary, as in (3).

(3)

After the embedding layer, the input amino acid sequence becomes a sequence of dense real-valued vectors (e₁, e₂, …e_t). Existing deep learning development toolkits Keras provide the embedding layer that can transform a (n_batches, sentence_length) dimensional matrix of integers representing each word in the vocabulary to a (n_batches, sentence_length, n_embedding_dims) dimensional matrix. Assumed that the output length is 8, The embedding stage maps each number in S₁ to a fixed length of vector. S₁ becomes a 8 × 8 matrix (in 4) after the embedding stage. From this matrix, we may represent Methionine with [0.4, −0.4, 0.5, 0.6, 0.2, −0.1, −0.3, 0.2] and represent Thyronine with [0.5, −0.8, 0.7, 0.4, 0.3, −0.5, −0.7, 0.8].

(4)

Convolution stage.

Convolution neural networks are widely used in image processing by discovering local features in the image. The encoded amino acid sequence is converted into a fixed-size two-dimensional matrix as it passed through the embedding layer and can therefore be processed by convolutional neural networks like images. Let X with dimension L_in × n be the input of a 1D convolutional layer. We use N filters of size k × n to perform a sliding window operation across all bin positions, which produces an output feature map of size N × (L_in − k + 1). As the example sequence, the convolution stage uses multiple 2-dimension filters W ⊆ R^2×8 to detect these matrixes, as in (5) (5) Where x_j is the j—th feature map, l is the number of the layer, W_j is the j—th filter, ⊗ is convolution operator, b is the bias, and the activation function f uses ‘Relu’ aiming at increasing the nonlinear properties of the network, as shown in (6).

(6)

The structure of convolution neural network is shown in Fig 2. Each filter is used to scan a feature in the sequence. In order to understand the convolution neural network more intuitive, we take out a 2 × 8 filter (7) in the convolution layer from the model trained with the best performance.

Download:

Fig 2. The structure of convolution neutral network.

The model uses 2 filters to obtain 2 feature maps, then apply a max-overtime pooling operation over the feature map and take the maximum value as the feature corresponding to the filter.

https://doi.org/10.1371/journal.pone.0188129.g002

(7)

W is used to detect the S₂, a 8—dimension vector is obtained, as shown below.

Then, a max-overtime pooling operation with the pooling_length = 2 is applied, r becomes S₃ (in the Eq 8), which known as a feature map detected by the filter W.

(8)

LSTM stage.

Although traditional RNNs have achieved significant results in speech recognition and text generation, the problem of vanishing and exploding gradients has made it difficult to learn long-term dynamics. LSTM is a special recurrent neural network architecture and provides a solution by incorporating memory units that allow the network to learn when to forget previous hidden states and when to update hidden states given new information. It uses purpose-built memory cells to store information. The classical structure of a LSTM cell [26] is shown in Fig 3.

Download:

Fig 3. Long short-term memory cell.

https://doi.org/10.1371/journal.pone.0188129.g003

(9)

(10)

(11)

(12)

(13)

Components of a LSTM cell is explained by equations above. where σ is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate, output gate, cell and cell input activation vectors, all of which are the same size as the hidden vector h. The weight matrix subscripts have the obvious meaning, for example W_hi is the hidden-input gate matrix, W_xo is the input-output gate matrix etc. The weight matrices from the cell to gate vectors (e.g. W_ci) are diagonal, so element m in each gate vector only receives input from element m of the cell vector. The bias terms (which are added to i, f, c and o) have been omitted for clarity.

In our model, the features from previous stage are passed to LSTM network as input. The LSTM generates fixed length feature representation of the output represented by S₄ from the second convolution layer.

Activate and loss functions.

In general, a sigmoid function demonstrates well mathematics behaviors such as real-valued, differentiable, having a non-negative or non-positive first derivative, one local minimum, and one local maximum. So, in this work, we use it as the activation function of the network, see Eq 14.

(14)

A loss function measures how well a machine learning model fits empirical data. In this study, a binary cross entropy [27]. is applied to assess the prediction performances, see Eq 15. (15) Where t is the target and o is the output.

The whole process is implemented in Keras framework, a minimalist and highly modular neural networks library. Keras is written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation, and supported both CPU and GPU.

Results

Experiment setups

We used three kinds of datasets including balanced, unbalanced and multi-species to benchmark the performance of different models. For each dataset, 80%of them are chosen randomly for training, the rest of them for testing. The final performance is given via the best of the k-fold (k = 3, 5, 10) cross validations.

All the experiments use same parameters for the network. The input parameters and output sizes of each layer are shown in Table 6.

Download:

Table 6. The parameters and output sizes of each layer.

https://doi.org/10.1371/journal.pone.0188129.t006

Evaluation measures

To evaluate the performance of the proposed method, a couple of assessment measures are used in this study. These criteria includes accuracy, sensitivity, specificity. There are defined in Eqs from 16 to 18. (16) (17) (18) Where TP, TN, FN, and FP are the numbers of true positives, true negatives, false negatives, and false positives, respectively. Among these measures, the sensitivity indicates the accuracy of predicting positive samples, the specificity indicates the accuracy of predicting negative samples, and the accuracy is defined as the ratio of correctly predicted samples in test set.

Additionally, the area under a receiver operating characteristic curve (AUC) is also applied to evaluate the performances. AUC is a robust overall measure because its calculation relies on the complete ROC curve and thus involves all possible classification thresholds.

The results in equal data set

To demonstrate the ability of the proposed method for predicting DNA binding proteins, we first evaluate it on the independent testing dataset by the k-fold (k = 3, 5, 10) cross validation. In the k-fold cross-validation, protein sequences are randomly divided into k equal parts. In each experiment, one part is kept for the testing set and the other k − 1 parts are used as the training set. The accuracies for 3, 5 10-fold experiments are 87.5%, 92.8%and 93.1% respectively. Then we use the best model of in the 5-fold experiment (see Fig 4) to test sequences from Arabidopsis and Yeast species [25], compare the accuracies with the ones with DNA binder and LibSVM predictions, see Table 7. The results show that the prediction accuracies of our model outperform LibSVM nearly by 8% and 4% for Arabidopsis and yeast species respectively.

Download:

Fig 4. Results of 5-fold cross validation.

https://doi.org/10.1371/journal.pone.0188129.g004

Download:

Table 7. The prediction accuracies across different models.

https://doi.org/10.1371/journal.pone.0188129.t007

The results in realistic data set

For the realistic dataset, we calculate their accuracy, sensitivity, specifity and auc values shown in Table 8 and draw the ROC curves for testing and validation datasets in Figs 5 and 6 respectively.

Download:

Table 8. The results in the realistic data set.

https://doi.org/10.1371/journal.pone.0188129.t008

Download:

Fig 5. The ROC of the test set.

https://doi.org/10.1371/journal.pone.0188129.g005

Download:

Fig 6. The ROC of the validation set.

https://doi.org/10.1371/journal.pone.0188129.g006

From the results, we can see that our model works well for both of class imbalanced and balanced datasets with the competitive ROC behaviors, which is a very hard situation for traditional machine learning methods.

Results in multi-species dataset

To further verify the generality of our method across species, we train three models for human, mouse and rice species, then use these models to test others. Table 9 shows the results across different species.

Download:

Table 9. The results in multi-species dataset.

https://doi.org/10.1371/journal.pone.0188129.t009

From the results, human model works well in both itself and mouse, and vice versa. The rice model works well for itself, but lower accuracy in human and mouse. These results coincide the fact that human has close genetic relationship with mouse, is far from the rice.

Performance comparisons with different feature extraction methods

We also compare the performances of our deep learning model with other feature extraction methods on both of the equal datasets and realistic datasets.

Three kinds of feature extraction methods, including 188D [28], Auto Covariance(AC) [29] and Conjoint triad(CT) scores [30] are used, and the linear regression, support vector machine, random forests are applied to test the performances of those features.

The 188D extracted the sequence characteristics according to the composition, distribution and physiochemical properties of amino acids and then formed a 188-dimension vector to represent the raw sequence. The AC method took into account the interactions between amino acids within the entire sequence, and represented each sequence by a vector of AC variables describing the average interactions between residues. The CT method divided all amino acids into seven categories and regarded three consecutive amino acids as a unit, and represented each sequence by a 343-dimension vector of the frequencies of triad types appearing in the amino acid sequence.

The performance comparisons using the same training and testing datasets are summarized in Tables 10 and 11 respectively. For the equal dataset, our model outperforms 2% of the best known previous results by the combination of 188D and SVM, and more than 10% by the average of all others. For the realistic dataset, our model performs similar accuracy with the best (188D+RF) of others, but outperforms its sensitivity and AUC by 0.2627, and 0.1511 respectively. This suggests that the proposed model is more reliable and robust.

Download:

Table 10. Performance comparisons on the equal dataset.

https://doi.org/10.1371/journal.pone.0188129.t010

Download:

Table 11. Performance comparisons on the realistic dataset.

https://doi.org/10.1371/journal.pone.0188129.t011

Performance comparisons with low-redundancy training sets

We train the models on the low-redundancy versions of the equal and realistic datasets, and compare the performances with 188D+SVM method. The results are shown in Tables 12 and 13.

Download:

Table 12. The results in low-redundancy equal data set.

https://doi.org/10.1371/journal.pone.0188129.t012

Download:

Table 13. The results in low-redundancy realistic data set.

https://doi.org/10.1371/journal.pone.0188129.t013

For the low-redundancy version of equal dataset, the accuracy is lower 3.86% than full version and slight higher than 188D+SVM method. When the model is applied to Arabidopsis and yeast datasets, the accuracies are 85% and 78% respectively, which are slight lower than ones in the full model.

For the low-redundancy version of realistic dataset, its model works worse than the full model, but better than the 188D+SVM method over all the measures.

These results suggest that the sequence redundancy in the training dataset does not decrease the performances, while somehow increase the power of prediction capability because deep learning requires huge volume of data to fit its model and provides mechanisms (especially the dropout technology) to overcome model over-fitting.

Discussion

In computer vision area, recent research [31] reveals that networks depth is of crucial importance, for example on the challenging ImageNet, models were exploited with a depth sixteen [31] to thirty [32]. In order to compare the effects of different depths of networks and lengths of filters, we have designed two other models. The first contrastive model is a single layer CNN with the filter length 5. The second one has two layers CNN with the filter length 5 of the first CNN layer.

The Fig 7 shows that the model with two layers CNN can speedup the convergence of loss functions, while the model with large filter length can get higher convergence of loss functions. The prediction accuracy demonstrates similar behaviors, see Fig 8.

Download:

Fig 7. Loss comparisons in different models.

https://doi.org/10.1371/journal.pone.0188129.g007

Download:

Fig 8. ACC comparisons in different models.

https://doi.org/10.1371/journal.pone.0188129.g008

During the experiments, we found that the performances of the neural networks are somehow driven by data rather than the structures designed artificially. For small datasets, deep learning doesn’t have a more excellent performance than traditional machine learning methods. With the rapidly growth of protein sequence data, the advantage of deep learning can be reflected increasingly. Meanwhile the speed of computing is an indispensable problem. GPUs are often used to accelerate the computational speed for this situation.

A most recent work predicted DNA-binding proteins interacting with ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA) using OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA) [33]. Testing by SVM (support vector machine) and RF (random forest) classification model, their method can achieve the accuracy of 88.7% and AUC of 0.919. Our method achieve the accuracy of 94.2% and AUC of 0.961 on the realistic data set. Moreover, the deep learning approach can speedup the procedures of trivial feature selection and enable scientists put more efforts on biological analysis.

All the source codes used in this study are available at the figshare server (https://doi.org/10.6084/m9.figshare.5231602.v1). A user-friendly web-server for predicting DNA binding proteins is accessible at http://119.23.251.26/WebServer/.

Conclusion

Computational biologists are often be struggling to successfully extract meaningful features and choose an appropriate machine learning algorithm in predicting spatial structures or functions of biological sequences. The deep learning framework capable of learning features automatically and training models in a back propagation way is making a big success towards these fields. In this paper, we presented a deep learning based approach for predicting DNA binding functions of proteins only using primary sequences. The two layers of CNN plus LSTM networks allow for an increase in learning power and contain more potential for motif refinements in both of local connectivity and long-term dependence.

Compared with DNA binder and LibSVM, the proposed method shows a state-of-the-art performance on both of the equal and realistic data sets. It also demonstrates substantial generality across multi-species testing. Moreover, the method outperforms most of the existing feature extraction methods plus a successful machine learning algorithm in terms of accuracy, specificity, sensitivity and AUC. This comprehensive investigation of the deep learning model in predicting DNA binding functions of proteins might yield a competitive tool for future proteomics studies. The proposed deep learning approach would have many other potential applications, such as protein remote homology detection [34], miRNA prediction [35], etc.

Acknowledgments

This research is partly supported by the Natural Science Funding of China under grant number 61170177, National Basic Research Program of China under grant number 2013CB32930X and National High Technology Research and Development Program of China under grant number 2013CB32930X.

References

1. Kumar M, Gromiha MM, Raghava GPS. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC bioinformatics. 2007;8:463. pmid:18042272
- View Article
- PubMed/NCBI
- Google Scholar
2. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research. 2003;31(13):3692–3697. pmid:12824396
- View Article
- PubMed/NCBI
- Google Scholar
3. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PloS one. 2011;6(9):e24756. pmid:21935457
- View Article
- PubMed/NCBI
- Google Scholar
4. Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics. 2013;14(1):90. pmid:23497329
- View Article
- PubMed/NCBI
- Google Scholar
5. Zhou C., Yu H., Ding Y., Guo F., Gong X. Multi-Scale Encoding of Amino Acid Sequences for Predicting Protein Interactions Using Gradient Boosting Decision Tree. PLoS ONE, 2017.
- View Article
- Google Scholar
6. Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian NaÃ¯ve Bayes. PLoS ONE. 2014;9(1):1–10.
- View Article
- Google Scholar
7. Ma X, Guo J, Sun X. DNABP: Identification of DNA-binding proteins based on feature selection using a random forest and predicting binding residues. PLoS ONE. 2016;11(12):1–20.
- View Article
- Google Scholar
8. Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, et al. iDNA-Prot| dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PloS one. 2014 Sep 3;9(9):e106691. pmid:25184541
- View Article
- PubMed/NCBI
- Google Scholar
9. Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Molecular Informatics. 2015 Jan 1;34(1):8–17. pmid:27490858
- View Article
- PubMed/NCBI
- Google Scholar
10. Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Scientific Reports. 2015;5(October):15479. pmid:26482832
- View Article
- PubMed/NCBI
- Google Scholar
11. Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning. IEEE Transactions on Nanobioscience. 2016;15(4):328–334.
- View Article
- Google Scholar
12. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic acids research. 2015 May 9;43(W1):W65–71. pmid:25958395
- View Article
- PubMed/NCBI
- Google Scholar
13. Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget. 2017 Feb 21;8(8):13338. pmid:28076851
- View Article
- PubMed/NCBI
- Google Scholar
14. Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS[C] 2012.
- View Article
- Google Scholar
15. Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks. In Proceedings of ICASSP [C] 2013.
- View Article
- Google Scholar
16. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. NIPS, 2014.
- View Article
- Google Scholar
17. Angermueller C, Pärnamaa T, Parts L, Oliver S. Deep Learning for Computational Biology. Molecular Systems Biology. 2016; (12):878. pmid:27474269
- View Article
- PubMed/NCBI
- Google Scholar
18. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–838. pmid:26213851
- View Article
- PubMed/NCBI
- Google Scholar
19. Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics. 2016;32(12):i121–i127. pmid:27307608
- View Article
- PubMed/NCBI
- Google Scholar
20. Pan X, Shen HB. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. bioRxiv. 2016;
- View Article
- Google Scholar
21. Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. arXiv q-bioBM. 2016;9:00680.
- View Article
- Google Scholar
22. Jo T, Hou J, Eickholt J, Cheng J. Improving Protein Fold Recognition by Deep Learning Networks. Scientific reports. 2015;5:17573. pmid:26634993
- View Article
- PubMed/NCBI
- Google Scholar
23. Wang S, Peng J, Ma J, Xu J. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Scientific reports. 2016;6:18962. pmid:26752681
- View Article
- PubMed/NCBI
- Google Scholar
24. Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE. 2015;10(11):1–15.
- View Article
- Google Scholar
25. Motion G B, Howden A J M, Huitema E, et al. DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic Acids Research, 2015, 43(22):e158–e158. pmid:26304539
- View Article
- PubMed/NCBI
- Google Scholar
26. Graves A. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv: 1308–0850, 2013.
- View Article
- Google Scholar
27. Shore J E, Johnson R W. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. Information Theory IEEE Transactions on, 1980, 26(1):26–37.
- View Article
- Google Scholar
28. Lin C, Zou Y, Qin J, et al. Hierarchical classification of protein folds using a novel ensemble classifier. Plos One, 2012, 8(2).
- View Article
- Google Scholar
29. Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein—protein interactions from protein sequences. Nucleic acids research. 2008 May 1;36(9):3025–30. pmid:18390576
- View Article
- PubMed/NCBI
- Google Scholar
30. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein—protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007 Mar 13;104(11):4337–41.
- View Article
- Google Scholar
31. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science, 2014.
- View Article
- Google Scholar
32. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- View Article
- Google Scholar
33. Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, et al. Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences. BMC Bioinformatics. 2017 Jun 12;18(1):300. pmid:28606086
- View Article
- PubMed/NCBI
- Google Scholar
34. Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2013 Dec 5;30(4):472–9. pmid:24318998
- View Article
- PubMed/NCBI
- Google Scholar
35. Liu B, Liu F, Fang L, Wang X, Chou KC. repRNA: a web server for generating various feature vectors of RNA sequences. Molecular Genetics and Genomics. 2016 Feb 1;291(1):473–81. pmid:26085220
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Kumar M, Gromiha MM, Raghava GPS. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC bioinformatics. 2007;8:463. pmid:18042272
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research. 2003;31(13):3692–3697. pmid:12824396
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PloS one. 2011;6(9):e24756. pmid:21935457
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics. 2013;14(1):90. pmid:23497329
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Zhou C., Yu H., Ding Y., Guo F., Gong X. Multi-Scale Encoding of Amino Acid Sequences for Predicting Protein Interactions Using Gradient Boosting Decision Tree. PLoS ONE, 2017.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian NaÃ¯ve Bayes. PLoS ONE. 2014;9(1):1–10.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref7] 7. Ma X, Guo J, Sun X. DNABP: Identification of DNA-binding proteins based on feature selection using a random forest and predicting binding residues. PLoS ONE. 2016;11(12):1–20.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref8] 8. Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, et al. iDNA-Prot| dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PloS one. 2014 Sep 3;9(9):e106691. pmid:25184541
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Molecular Informatics. 2015 Jan 1;34(1):8–17. pmid:27490858
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Scientific Reports. 2015;5(October):15479. pmid:26482832
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning. IEEE Transactions on Nanobioscience. 2016;15(4):328–334.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref12] 12. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic acids research. 2015 May 9;43(W1):W65–71. pmid:25958395
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget. 2017 Feb 21;8(8):13338. pmid:28076851
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref14] 14. Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS[C] 2012.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref15] 15. Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks. In Proceedings of ICASSP [C] 2013.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref16] 16. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. NIPS, 2014.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref17] 17. Angermueller C, Pärnamaa T, Parts L, Oliver S. Deep Learning for Computational Biology. Molecular Systems Biology. 2016; (12):878. pmid:27474269
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref18] 18. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–838. pmid:26213851
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref19] 19. Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics. 2016;32(12):i121–i127. pmid:27307608
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref20] 20. Pan X, Shen HB. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. bioRxiv. 2016;
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref21] 21. Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. arXiv q-bioBM. 2016;9:00680.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref22] 22. Jo T, Hou J, Eickholt J, Cheng J. Improving Protein Fold Recognition by Deep Learning Networks. Scientific reports. 2015;5:17573. pmid:26634993
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref23] 23. Wang S, Peng J, Ma J, Xu J. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Scientific reports. 2016;6:18962. pmid:26752681
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref24] 24. Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE. 2015;10(11):1–15.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref25] 25. Motion G B, Howden A J M, Huitema E, et al. DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic Acids Research, 2015, 43(22):e158–e158. pmid:26304539
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Graves A. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv: 1308–0850, 2013.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref27] 27. Shore J E, Johnson R W. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. Information Theory IEEE Transactions on, 1980, 26(1):26–37.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref28] 28. Lin C, Zou Y, Qin J, et al. Hierarchical classification of protein folds using a novel ensemble classifier. Plos One, 2012, 8(2).
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref29] 29. Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein—protein interactions from protein sequences. Nucleic acids research. 2008 May 1;36(9):3025–30. pmid:18390576
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref30] 30. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein—protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007 Mar 13;104(11):4337–41.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref31] 31. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science, 2014.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref32] 32. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref33] 33. Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, et al. Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences. BMC Bioinformatics. 2017 Jun 12;18(1):300. pmid:28606086
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref34] 34. Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2013 Dec 5;30(4):472–9. pmid:24318998
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref35] 35. Liu B, Liu F, Fang L, Wang X, Chou KC. repRNA: a web server for generating various feature vectors of RNA sequences. Molecular Genetics and Genomics. 2016 Feb 1;291(1):473–81. pmid:26085220
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Data sets

Methods

Deep learning model structure.

Protein sequence encoding.

Embedding stage.

Convolution stage.

LSTM stage.

Activate and loss functions.

Results

Experiment setups

Evaluation measures

The results in equal data set

The results in realistic data set

Results in multi-species dataset

Performance comparisons with different feature extraction methods

Performance comparisons with low-redundancy training sets

Discussion

Conclusion

Acknowledgments

References