Coding of amino acids by texture descriptors
Introduction
Several applications need to extract features from peptides/proteins for solving a given classification problem [1], some examples are: sub-cellular localization [2], protein–protein interactions [3], HIV-1 protease cleavage site prediction [4], [5].
Probably, the most used feature extractor for peptides and proteins is the Chou's pseudo amino acid (PseAA) composition [6]. In the literature, several variants of these descriptors have been proposed: hydropathy scales [7], [8], physicochemical distance [9], digital code [10], complexity factor [11], [12], digital signal [13], Fourier low-frequency spectrum [14], cellular automata [15], “artificial” features created by genetic programming combining one or more “original” Chou's pseudo amino acid features [16]. The interested reader can refer to [17] and [18] for a survey of the different methods for extracting features from peptides and proteins.
Most of the feature extractors proposed in the literature are based on a vectorial representation of the peptide/protein. For example, in [19] a physicochemical encoding is proposed: each amino acid is represented by a 20-dimesional vector with all values set to zero except for the one corresponding to the considered amino acid, which takes the value of the measured physicochemical property. The descriptor associated to a peptide/protein is obtained by concatenating all the 20-dimesional vectors corresponding to its amino acid sequence.
Other interesting encoding methods, here reported for completeness, are based on kernels. One of the first approaches is the Fisher kernel [20] proposed for remote homology detection. A different kernel, the mismatch string kernel, is proposed in [21], which measures similarity among two sequences of amino acids based on shared occurrences of subsequences. In [21] it is shown that string kernels have performance similar to Fisher kernel with a lower computational cost. A class of new kernels is developed in [22] which obtain good performance for predicting protein sub-cellular localization: a set of kernel functions derived from k-peptide vectors mapped by a matrix of high-scored pairs, measured by BLOSUM62 scores, of k-peptides, are used for training a support vector machine. Another interesting approach is the bio-basis function neural network [23], in this method the sequences are not encoded in a feature space; instead, the distances obtained by sequence alignment are used to train the neural network.
The aim of this paper is to propose a novel descriptor obtained from a matrix representation of the peptides/proteins. Analogously to many of the above cited methods the physicochemical properties are considered to discriminate among the amino acids: each descriptor, which is a squared matrix of the dimension of the peptide/protein, is obtained considering a partial ordering of the amino acids of the peptide/protein according to a given physicochemical property. A more compact representation of this matrix descriptor is obtained by considering such matrix as an image and using a texture descriptor to obtain a scale-invariant representation, independent on the length of the peptide\protein. Several well-known texture descriptors are tested in this paper: local binary pattern (LBP), which extracts a histogram that describes the difference between each matrix point and its neighborhood; discrete cosine transform (DCT); Daubechies wavelet, which performs a multi-resolution analysis of the image.
The experimental section reports several tests on the following datasets: vaccine dataset for the predictions of peptides that bind human leukocyte antigens; HIV-1 protease cleavage site prediction dataset and membrane proteins type dataset. Our results show that the proposed descriptors obtain valuable classification accuracy and can be considered for a fusion with other standard descriptors to further improve the classification performance.
The remaining of the paper is organized as follows. Section 2 briefly reviews the related works on the three applications tested in this paper; Section 3 introduces the feature extraction method proposed in this work; Section 4 reports experimental results obtained on three different classification problems; finally, Section 5 draws some conclusions.
Section snippets
HIV-1 protease cleavage site prediction
For the replication of the AIDS virus, the HIV-1 protease [24], [25], [26] is essential. The inhibitors of the protease bind the active site in HIV-1 protease and do not permit the normal functioning of the protease. In the literature, several methods for HIV-1 protease cleavage sites in proteins prediction are published, most based on machine learning systems: in [27], [28], [29] a standard feed-forward multilayer perceptron (MLP) is proposed to outperform the decision tree classifier; in [24]
System description
The system proposed in this work is based on a matrix representation of the peptide/protein, which is treated as an image and characterized using a texture descriptor. A SVM classifier is trained using the extracted texture features to perform the classification task. A graphical schema of the proposed system is reported in Fig. 1. The following subsections describe the main steps of the approach.
Datasets and protocols
The proposed system has been tested using the following datasets:
- -
HIV: This dataset [48] is the biggest dataset ever tested for the HIV-1 protease problem. It contains 1625 octamer protein sequences: 374 HIV-1 protease cleavable sites; 1251 uncleavable sites. In this dataset the ten-fold cross-validation testing protocol is used.
- -
Vaccine (VAC): This dataset [34] contains peptides from five HLA-A2 molecules that bind/non-bind multiple HLA. The testing protocol suggested in [34] has been adopted,
Conclusion
In this paper, we have presented a novel method for describing the peptides/proteins, based on the calculation of texture descriptors from a matrix representation of the peptides/proteins. The novel method is based on the selection of a physicochemical property which is used to construct a representation of the peptide/protein as a matrix (using a method based on the Hasse matrix); this matrix representation is considered as an image and several texture descriptors are extracted and used for
References (53)
- et al.
MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM
Biochem Biophys Res Commun
(2007) Comparison among feature extraction methods for HIV-1 protease cleavage site prediction
Pattern Recognit
(2006)Prediction of protein subcellular locations by incorporating quasi-sequence-order effect
Biochem Biophys Res Commun
(2000)- et al.
Low-frequency Fourier spectrum for predicting membrane protein types
Biochem Biophys Res Commun
(2005) - et al.
Review: recent progresses in protein subcellular location prediction
Anal Biochem
(2007) - et al.
MppS: an ensemble of support vector machine based on multiple physicochemical properties of amino-acids
NeuroComputing
(2006) A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins
J Biol Chem
(1993)Review: prediction of HIV protease cleavage sites in proteins
Anal Biochem
(1996)- et al.
Neural network prediction of the HIV-1 protease cleavage sites
J Theor Biol
(1995) - et al.
Artificial neural network model for predicting HIV protease cleavage sites in protein
Adv Eng Soft
(1998)