Keywords

1 Introduction

Recent advancements in sequencing technologies have seen an exponential growth in protein sequences. One of the modern studies is to identify the function of enzymes. In labs, it is a time-consuming experiment to decipher the function of the sequences. Moreover, it is necessary to predict whether the protein is an enzyme or non-enzyme. The classification process of enzymes is organized based on its biochemical reactions by the Enzyme Commission of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) [1].

According to main Enzyme Commission EC (is a numerical classification scheme for enzymes, based on the chemical reactions) enzymes are classified into six main classes such as (I) oxidoreductase, catalysing oxidation-reduction reactions; (II) transferase, transferring a chemical group from one substrate (the donor) to another (the acceptor); (III) hydrolase, These enzymes catalyse the hydrolytic cleavage of bonds; (IV) lyase, catalysing the nonhydrolytic and cleave C–C, C–O, C–N and other bonds by means other than hydrolysis or oxidation; (V) isomerase, catalysing geometrical or structural changes within one molecule; (VI) ligase, catalysing the joining together of two molecules coupled with hydrolysis of a pyrophosphate bond in ATP or a similar triphosphate. The recommended name often takes the form A-B ligase [2].

For prediction of enzyme function, there are many approaches are widely used. The first method was to predict enzyme functions based on sequence similarity [3]. Faria et al. [3], developed a machine learning methodology called Peptide Program (PPs) that achieved a high accuracy when used small dataset. Des Jardins et al. [4], proposed a novel approach to predict the function of a protein from its amino acid sequences, they achieved performance better than famous algorithms, e.g. Basic Local Alignment Sequence Tool (BLAST) and Support Vector Machine (SVM) in case using a small dataset. Many other studies were based on extracting features from sequences. Mohammed and Guda proposed a model [5], reviewed a group of studies, which used a number of computational to predict and classify enzymes based on the extracted features, e.g. molecular weight, Polar, etc. Moreover, many studies predicted the function of enzymes based on combining two different methods. Omer et al. present a new method for extracting features from motif content and protein composition for protein sequence classification [6]. Bum et al. proposed new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist for predicting protein function, and useful feature subset for predicting the function of various proteins [7]. Moreover, Chien et al. reported that, some specific features represent the physiochemical properties of protein complex subunits [8].

In this paper, we proposed an approach to predict enzyme function class based on different sequence alignment techniques. We compare between the local and global sequence alignment technique to select the best technique in predicting the enzyme function class. Moreover, to improve the classification accuracy, different score matrices such as BLOSUM and PAM are used. The computer program which was used for the implementation of this approach has been written and developed by the authors (using MATLAB 7.11 (R2010b)) and without the need for any other ready software or online tools, which adds more facility to use different assessment methods and develop our approach to increase the accuracy.

The rest of this paper is organized as follows. Section 2 describes the theoretical background of the sequence alignment technique. Moreover, the local and global sequence alignment techniques are explained; Sect. 3 introduces the proposed approach; Sect. 4 presents experimental scenarios, results, and discussion; finally, Sect. 5 summarizes the conclusions and future work.

2 Preliminaries

2.1 Enzyme Prediction and Classification

There are three famous approaches widely used for enzyme prediction and classification based on protein’s structure, protein’s features, and sequence alignment. Many studies were classified the enzymes based on structure (e.g. the structure information of protein more conserved from sequence information, protein functions governed by its structure, and sometimes sequence similarity between two proteins exists). But, this method not strong enough to produce an unambiguous alignment). Whoever, recent approaches used the simple sequence, structure properties and many features that extracted from the enzymes (i.e. predicted secondary structures, physiochemical properties, sequence motifs, and highly conserved regions) [3]. The third approach is based on sequence similarity measure. This approach is the most common way of functional prediction, and the majority of sequences in the protein databases is annotated using just sequence comparison [9].

2.2 Sequence Alignment Technique

Sequence alignment is a process of arranging two (Pairwise Alignment) or more (Multiple Alignment) sequences of characters to define the region of similarity and calculate the similarity score. There are many problems and challenges that face the process of calculating similarity scores such as, (i) the sequences usually are in different length and different gaps, (ii) There may be, the matching (similar) region between two sequences is a relatively small region with respect to the length of each sequence [10, 11].

Consider X and Y is a pair of two different protein or DNA sequences, where \(X \equiv x_1 x_2 \dots x_m, Y \equiv y_1 y_2 \dots y_n\), where \(x_i\) and \(y_i\) are letters chosen from the alphabet. To calculating the distance or similarity between X and Y, the two sequences are aligned first using sequence alignment technique such as local or global sequence alignment techniques.

Fig. 1
figure 1

Pairwise sequence alignment process, indicates match, mismatch, insertion, and deletion gap cases

In each sequence alignment process, gaps are an arbitrary number of null characters or spaces (represented by dashes) may be placed and it is called indel. Each indel represents an insertion of a character into one sequence or deletion of a character from the other one as shown in Fig. 1.

There are many techniques to calculate the score of alignment gaps with letters (Gap score). The constant gap penalty is considered one of the simplest method. In this technique, a gap with any size, receives the constant negative penalty \(-g\). In a Linear gap penalty, the penalty depends on the length of the gap. So, \(-g\) in this technique represents a penalty per unit length of the gap. Gap scores are independent of letters. However, gap penalty value is subtracted from the total alignment score to calculate the final score [10].

After aligning the pair of sequences, we need to measure the score of alignment or similarity. Calculating the scores or similarity of alignment depends on the values of matching scores or mismatch penalty scores. In other words, it depends on the scoring or substitution matrix that have the values of matching scores and mismatching penalty scores [11].

Scoring matrix or substitution matrix is a set of values representing the likelihood of one residue being substituted by another. Since, the proteins composed of 20 amino acid, the scoring matrices for proteins are \(20\times 20\) matrix. Two well-known scoring matrices for proteins are PAM and BLOSUM [10].

In PAM (Point Accepted Mutation or Percent Accepted Mutation), a substitution of one amino acid by another that has been fixed by natural selection. In PAM1 (i.e. original PAM score matrix) \(1\,\%\), of the amino acids in a sequence are expected to accept mutation, so 1 % sequence divergence. PAM score matrix applies when the divergence is low. The higher suffix number, the better it is in dealing with distant sequence alignment [12]. Increasing the suffix number of PAM scoring matrix will increase the score of alignment between two different sequences, which reflects that, when the suffix number increased the aligned distant sequences can be identified better. The (ith,jth) cell in a PAM matrix denotes the probability that amino-acid i will be replaced by amino-acid j [12].

In BLOSUM (BLOck SUbstitutions Matrices), the blocks are ungapped multiple sequence alignment corresponding to the most conserved regions of the sequences involved. Default BLOSUM matrix is BLOSUM62 (used in BLAST algorithms), which represents the sequences used to create the BLOSUM62 matrix have approximately 62 % identity. Hanikoff and Hanikoff tested the performance of BLOSUM and PAM and found that BLOSUM matrices performed better than PAM matrices. In contrast, with PAM matrices, the higher the suffix number in BLOSUM matrix, the better it is in dealing with closer sequences [12].

The alignment score is the sum of the scores for aligning pairs of letters (alignment of two letters) and gap scores (alignment of a gap with a letter). For example, if the matching score is five, mismatch penalty is \(-3\), and gap or space penalty is \(-2\), so the alignment score of the two sequences in Fig. 1 equal to \(=5-2-2-2+5-3+5-2-2+5=7\). In the next section, global and local sequence alignment techniques which considered the most popular sequence alignment techniques are discussed.

Global Sequence Alignment: In Global Alignment technique, all letters and null in each sequence must be aligned from one end to the other as shown in Eq. (1), where s(ab) represents score for aligning letters a and b, g represents gap score, \(x_i\) is the partial sequence consisting of i th letter of \(X \equiv x_1 x_2 \dots x_m\), \(y_i\) is the partial sequence consisting of j th letter of \(Y \equiv y_1 y_2 \dots y_n\), and SIM(ij) represents the similarity of \(x_i\) and \(y_j\). In the first line of Eq. (1), \(x_i\) and \(y_i\) are aligned. But, in the second line x is aligned with a null, while in the third line y is aligned with a null [13].

$$\begin{aligned} SIM(i,j)= max \left\{ \begin{array}{l} SIM(i-1,j-1)+s(x_i+y_i)) \\ SIM(i-1,j)+g \\ SIM(i,j-1)+g \end{array}\right. \end{aligned}$$
(1)

Example 1

Given two different sequences X and Y, where \(X={TGCCGTG}\) and \(Y={CTGTCGCTGCACG}\). The matching score is equal to four, mismatch penalty equals \(-2\), and gap penalty is also \(-2\).

For the pair of sequences with length m and length n, the size scoring matrix is needed for their alignment must be \((m+1) \times (n+1)\) as shown in Table 1.

The first step in any pairwise sequence alignment technique is to fill the entire matrix as shown in Table 1. In the global alignment algorithm, the scoring matrix step starts by assigning the two sequences’ letters and gap penalties on the left and top headers of the table as shown in Table 1. To fill all cells of the matrix, first assigns the gap penalty in the top row and the left column in a cumulative way. Then, assigns the maximum of all three cells to the new cell as in Eq. (1).

After finishing from assigning values to all cells as shown in Table 1, the next step in pairwise sequence alignment is to trace back according to the values that are calculated before in the scoring matrix. As shown in Table 1 we note that, the bold fonts show the trace-back of Example 1. The final step is to make alignment by tracing back methods as shown in Fig. 1. As shown in the figure we note that, going straight right or down means, insertion or deletion gap, respectively. While, moving in diagonal represents match or mismatch cases. The steps and directions we follow it to move from one cell to the other (forward) are the same directions we trace it back to align the two sequences. Finally, the final alignment is shown in Fig. 2.

Table 1 Score matrix of the global sequence alignment techniques using the two sequences in Example 1
Fig. 2
figure 2

Example of local and global sequence alignment techniques for pair of sequences in Example 1 and Example 2

Local Sequence Alignment: Smith-Waterman proposed the first algorithm that used local alignment to measure the similarity between different sequences as in Eq. (2). Global alignment works better if the sequences are similar in length and characters while the local alignment is useful when the sequences are not similar in length or characters. However, local alignment finds the most similar regions in any two sequences being aligned as shown in Fig. 2 [12]. In Fig. 2 we note that, the global alignment technique aligns the whole length of the two sequences, while the local alignment technique aligns the most similar region in the two sequences. So, the local alignment technique isolates regions in the sequences, hence it is easy to detect repeats while the global alignment is suitable for overall similarity detection.

$$\begin{aligned} SIM(i,j)=max \left\{ \begin{array}{l} SIM(i-1,j-1)+s(x_i+y_i) \\ SIM(i-1,j)+g \\ SIM(i,j-1)+g \\ 0 \end{array}\right. \end{aligned}$$
(2)

Example 2

Given two different sequences X and Y with different lengths m and n respectively, where \(X=[TGCCGTG]\) and \(Y=[CTGTCGCTGCACG]\) as in Example 1. The matching score is four, mismatch penalty equals \(-2\), and gap penalty is also equal to \(-2\). The size of the scoring matrix is needed for their alignment must be \((m+1) \times (n+1)\).

The first step in any local sequence alignment algorithm is to initialize first row and first column to be 0 as shown in Table 2. To calculate the value of that cell as in Eq. (2), we assign the maximum of all three cells or zero to the new cell. After filling all cells, trace back as in the global alignment technique. Finally, the final alignment is as shown in Fig. 2.

Table 2 Score matrix of the local sequence alignment techniques using the two sequences in Example 2

3 The Proposed Enzyme Function Classification Approach

The first step in the proposed approach is to collect labelled enzyme sequences with different functions or categories as shown in Fig. 3. In the proposed approach, we have two techniques to compute the pairwise alignment between any two sequences, namely, global and local alignment techniques. Thus, the similarity between the unknown sequence and each sequence in the labelled (i.e. training) sequences are computed using the global and local sequence alignment techniques. The similarity or score reflects the distance between the unknown sequence and all the labelled sequences. The class function of the sequence that has the maximum score (i.e. maximum similarity) is assigned to the unknown sequence. The details of the proposed approach are summarized in Algorithm 1.

Fig. 3
figure 3

Block diagram of the proposed approach

figure a

4 Experimental Results and Discussion

There are two scenario’s which designed to test the proposed approach. In all classes we used one against all approach or, in other words, for each unknown sequence we calculate the sequence alignment between an unknown sequence and all other sequences in the dataset and select the maximum similarity. To save time and memory storage, we used a subset of the whole dataset which represents approximately a half (\(\frac{3470}{6923}=50.12\,\%\)) of the total number of all sequences as shown in Table 4.

4.1 Dataset

The dataset, which was used in all experiments were obtained from SWISS-PROT enzyme database. It consists of sequences of enzymes from [14], that have 6923 instances and the details of each class are shown in Table 3. The distribution of sequences across different classes was not equivalent as shown in Table 3.

Table 3 Distribution of all sequences across different classes in SWISS-PROT enzyme database
Table 4 Accuracy and CPU time of the local and global sequence alignment techniques using different score matrices (i.e. the first and second experimental scenarios)

4.2 Experimental Scenarios

The first experiment scenario is conducted to compare between the global and local alignment techniques using the default BLOSUM score matrix (BLOSUM62). The summary of this scenario is shown in Fig. 4)and Table 4.

The second experiment scenario is conducted to understand the effect of changing score matrix of the global alignment and local sequence alignment techniques. The summary of this scenario is shown in Table 4.

Fig. 4
figure 4

Accuracy of all enzyme function classes using local versus global sequence alignment techniques

4.3 Discussion

Figure 4 and Table 4 show the accuracy (i.e. the percentage of the total number of predictions that were correct) of local alignment technique achieved a good relatively accuracy ranged from 91.2 to 93.3 %, while the global sequence alignment technique achieved accuracy ranged from 85.4 to 89.9 %, which reflects that, the local alignment technique is better to align two sequences and calculate an accurate similarity score than global sequence alignment technique. On the other hand, as shown in Table 4 the global alignment technique needs CPU time less than the local alignment technique. Moreover, it is apparent from Fig. 4 that the first three classes achieved accuracy better than the other three classes because the number of sequences of the first three classes represents about 85.9 % of the total number of sequences (2980/3470). Furthermore, the third class (Hydrolases) achieved the best accuracy among all classes while the fifth class (Isomerases) achieved the worst one.

Table 4 illustrates the local sequence alignment technique achieved accuracy better than global sequence alignment technique using different score matrices. The best accuracy of local alignment achieved using BLOSUM62 score matrix while the best accuracy of global alignment processes achieved when using BLOSUM30 score matrix.

Table 4 and Fig. 4 show the accuracy of the proposed approach is better than Jardins et al. who have developed a model achieved 74 % in first level and 68 % in second level of Enzyme Commission (EC) [4], and Kumar et al. [15] achieved accuracy near to 80.37 %. Finally, our proposed approach is fully self-implemented without any needs to online tools.

5 Conclusions and Future Work

In this paper, we have proposed, developed, and implemented an approach to identify the class function of an unknown sequence using sequence alignment technique. In the proposed approach, two different sequence alignment techniques are used, namely, local and global sequence alignment techniques. Through the application of global alignment and local alignment to compute the pairwise alignment for any sequence and assign it to the suitable functional class when it gets the maximum score of alignment. The approach which we developed using MATLAB achieved a reasonable classification accuracy reached 89.9, 93.3 % when we apply global alignment technique using BLOSUM30, and local alignment technique using BLOSUM62, respectively. The results which achieved by the proposed approach are very promising and opens prospects for future development. In the future work, we will use different information fusion methods to combine the scores of the local and global techniques to increase the accuracy of our approach.