Splice site prediction using support vector machines with a Bayes kernel

https://doi.org/10.1016/j.eswa.2005.09.052Get rights and content

Abstract

One of the most important tasks in correctly annotating genes in higher organisms is to accurately locate the DNA splice sites. Although relatively high accuracy has been achieved by existing methods, most of these prediction methods are computationally extensive. Due to the enormous amount of DNA sequences to be processed, the computational speed is an important issue to consider. In this paper, we present a new machine learning method for predicting DNA splice sites, which first applies a Bayes feature mapping (kernel) to project the data into a new feature space and then uses a linear Support Vector Machine (SVM) as a classifier to recognize the true splice sites. The computation time is linear to the number of sequences tested, while the performance is notably improved compared with the Naive Bayes classifier in terms of classification accuracy, precision, and recall. Our classification results are also comparable to the solution quality obtained by the SVMs with polynomial kernels, while the speed of our proposed method is significantly faster. This is a notable improvement in computational modeling considering the huge amount of DNA sequences to be processed.

Introduction

The advances in sequencing technologies have resulted in a large amount of DNA sequence information and therefore a dramatic increase in the size of genetic and genomic databases. The genome sequence information is produced as sequences of base pairs. However, no real knowledge of how the genome works is revealed unless different regions of the genome and their functions are characterized. Therefore, an important goal in bioinformatics is to accurately annotate the genome sequence information within an acceptable timeframe. Many computational efforts have recently been explored for predicting gene structures (Burge & Karlin, 1997) from DNA sequences and aiding the extensive analysis of the genome sequences, including recognizing translation initiation site of genes (Zien, Ratsch, Mika, Scholkopf, Lengauer & Muller, 2000), discovering transcriptional factor binding sites in promoter sequences (Lim, Sim, Chung, & Park, 2003), and identifying DNA splice sites (Jones and Watkins, 2000, Mache and Levi, 2000, Weber, 2001).

Gene expression in eukaryotes starts with the transcription of DNA sequences into mRNA sequences, followed by the processing of pre-mRNAs to mature mRNAs, and then the translation of mRNAs to proteins. Splicing is one of the primary post-processing steps of pre-mRNAs in eukaryotes. During splicing, the introns, the non-coding regions of genes, are removed from the primary transcripts, and the exons, the coding regions, are joined to form a continuous sequence that specifies a functional polypeptide (See Fig. 1 for illustration). The 5′ side of the intron is a donor splice site and 3′ side is an acceptor splice site. As most eukaryotic genes contain introns, many of which interrupt an exon within a codon, an important part of gene prediction in eukaryotes is therefore to predict splice sites.

This paper focuses on the problem of identifying DNA splice sites. Locating splice sites is an interesting problem to address because of the special structure in sequences around splice sites. The residual pairs GT and AG are often indicative of donor and acceptor splice sites. However, this canonical GTAG rule does not always hold. Thus, it is natural to model the prediction of splice sites as a binary classification problem, using DNA sequences with experimentally confirmed splice sites as positive training examples and those DNA sequences with GTAG structure but confirmed not to be real splice sites as negative training examples.

Artificial neural networks (Acır & Güzeli, 2004), Bayesian classifiers (Stockwell, 1993), and SVMs (Min and Lee, 2005, Shin et al., 2005) are important expert systems that have been applied to solve real world problems. Several of these expert systems have been applied to many interesting bioinformatics problems. For example Wang, Kuo, Chen, Hsiao, and Tsai (2005) built a knowledge sharing system for protein families (KSPF) using sequence pattern data mining and knowledge management. In this paper, we focus on the problem of recognizing true splice sites. Table 1 summarizes selected models used in predicting splice sites and their references. Although relatively high accuracy has been achieved with the methods currently available, almost all of the existing methods are computationally very demanding. Consequently, splice site prediction continues to be a major bottleneck in gene annotation.

In this study, we employ a linear SVM, which is computationally less extensive than SVMs with polynomial kernels, to recognize true splice sites. However, the DNA sequence information is given as strings while the SVM classifier can only take numerical inputs. Thus, the very first step is to encode or map the DNA sequences into numbers. A widely used encoding method is sparse encoding, where each letter in the DNA sequence is represented in four bits. But with this encoding method, the sequence data are in general linearly inseparable by SVMs. Instead, a novel mapping/encoding method derived from Bayes' rule is used to project the data into a new feature space where the true splice sites and the false splice sites can then be classified by linear SVMs. An advantage of the Bayes encoding method is that it takes into consideration the natural mutations in the DNA sequences with a probabilistic encoding framework. Experimental results have shown that the performance of our proposed method is comparable to that of SVMs with polynomial kernels in terms of accuracy, precision and recall, while the speed of our method is significantly faster. The computation time is linear to the number of sequences tested, while the performance is notably improved compared to the Naive Bayes classifier in terms of accuracy, precision and recall. Considering the overwhelming amount of DNA sequences that needs to be processed, the increased speed of our method is a very desirable property.

The rest of this paper is organized as follows. In Section 2, we give an introduction to SVMs. In Section 3, the Naive Bayes classifier is explained, and the Bayes feature mapping method is explored. In Section 4, we describe our experiment with splice site prediction and our theoretical analysis of the proposed method. In Section 5, the experiment results are presented. Finally, we give the conclusion in Section 6.

Section snippets

Support vector machines

Support Vector Machines (Vapnik, 1998) are powerful pattern recognition techniques that have been successfully applied to many machine learning tasks such as classification (Scholkopf, Burges, & Smola, 1999) and regression (Smola & Scholkopf, 2004). They have outperformed many other machine learning methods such as artificial neural networks and k-nearest neighbors and attracted a great deal of attention from the machine learning community because of many needed properties, including good

The Proposed algorithm-SVMs with Bayes kernel

Fig. 3 depicts a generic framework of the classification process used in this study. The proposed algorithm is a hybrid of SVMs and a Bayes feature mapping (denoted as SVM-B). SVMs are a binary classification method that discriminates one set of data points from another. They only take numerical data as input. However, the DNA sequences are given as strings of nucleotides {A, T, C, G}. When using computational tools to analyze and classify the sequence data, an important step is encoding the

Experimental design

We test the relative performance of the combined Bayes Mapping and SVMs method (denoted as SVM-B) in recognizing true splice sites, with a series of 10-fold cross validation experiments. We compare the performance of the proposed SVM-B method with Naive Bayes classifier, SVMs with linear kernel (SVM-L) and SVMs with polynomial kernels (SVM-P) with the d-value equal to 2 or 3. As a benchmark, we use the traditional sparse encoding method (Jones & Watkins, 2000) for these SVMs methods.

We report

Experimental results

The results of the experiments are shown in Table 6, Table 7, Table 8, Table 9. Table 6, Table 7 summarize the computational results and paired-t test for a small data set (Dsmall), where the accuracy, precision, recall and CPU times were averaged from the ten-fold cross validation experiments. Their standard deviations were also computed. Based on the average precision and recall, we then computed the overall F-measure.

As can be seen from the computational results, in terms of accuracy and F

Conclusion

Predicting splice sites is an important part of gene structure prediction. During the past years, several emerging machine learning methods such as artificial neural networks, perceptron, and support vector machines have been employed to approach the problem with sufficiently high accuracy in recognizing true splice sites. However, almost all of the existing methods are computationally extensive; therefore, splice site prediction remains a major bottleneck in gene annotation.

In this paper, we

References (33)

  • C.Z. Cai et al.

    SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence

    Nucleic Acids Research

    (2003)
  • N. Cristianini et al.

    An introduction to support vector machines and other kernel based learning methods

    (2000)
  • S. Degroeve et al.

    Feature subset selection for splice site prediction

    Bioinformatics

    (2002)
  • H. Hu et al.

    Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier

    IEEE Transactions on Nanobioscience

    (2004)
  • S. Hua et al.

    Support vector machine approach for protein subcellular localization prediction

    Bioinformatics

    (2001)
  • T. Joachims

    Making large-scale SVM Learning practical

  • Cited by (49)

    • iPReditor-CMG: Improving a predictive RNA editor for crop mitochondrial genomes using genomic sequence features and an optimal support vector machine

      2022, Phytochemistry
      Citation Excerpt :

      All the above results confirm the necessity and feasibility of using appropriate machine learning methods to construct editing site prediction models. Support vector machine (SVM) is one of the most important learning machines based on statistical learning theory (Liu et al., 2020) and has the advantage of simplicity and convenience for binary classification problems (Zhang et al. 2006, 2012; Sun et al., 2015; Dai et al., 2021). Based on structural risk minimisation rather than empirical risk minimisation, SVM not only helps to solve certain problems such as small samples, nonlinear, dimensional disasters, and local minimum problems, but also helps to provide powerful generalisation (Cortes and Vapnik 1995).

    • Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition

      2019, Gene
      Citation Excerpt :

      Encoding schemes used in this study are precisely elaborated in the following sub-sections. This encoding approach was proposed by Zhang et al. (2006) with an aim to improve the splice site prediction accuracy. In this approach, each splice site sequence was encoded with the nucleotide frequency matrix of both TSS and FSS datasets.

    • Markovian encoding models in human splice site recognition using SVM

      2018, Computational Biology and Chemistry
      Citation Excerpt :

      Moreover, a new evaluation of Markovian encoding models is provided. To improve accuracy and reduce time complexity of splice site detection approaches, Zhang et al. (2006) used linear SVM with a Bayes kernel (B-SVM). Huang et al. (2006) proposed an efficient DNA encoding method by considering pairwise nucleotides (PN) and calculating their frequency difference between true and false sites (FDTF).

    • Supervised learning: Classification

      2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics
    View all citing articles on Scopus
    View full text