Coding of amino acids by texture descriptors

https://doi.org/10.1016/j.artmed.2009.10.001Get rights and content

Abstract

Objective

In this paper we propose a new feature extractor for peptide/protein classification based on the calculation of texture descriptors. Representing a peptide/protein using a matrix descriptor, instead of a vector, allows to deal with the peptide/protein as an image and to use texture descriptors for representation purposes.

Methods and materials

A matrix descriptor, which is a squared matrix of the dimension of the peptide/protein, is obtained considering a partial ordering of the amino acids of the peptide/protein according to their value of a given physicochemical property. Each matrix descriptor is considered as a texture image and several texture descriptors are considered to obtain a compact representation which is scale invariant (i.e. independent on the length of the peptide\protein). The texture descriptors tested in this work are: local binary patterns (LBP), discrete cosine transform (DCT) and Daubechies wavelets.

Results and conclusion

The experimental section reports several tests, aimed at supporting our ideas, performed on the following datasets: vaccine dataset for the predictions of peptides that bind human leukocyte antigens; human immunodeficiency virus (HIV-1) protease cleavage site prediction dataset and membrane proteins type dataset.

The experimental results confirm the usefulness of the novel descriptors: the performance obtained by our system on the three difficult datasets is quite high, indicating that the proposed method is a feasible system for extracting information from peptides and proteins. The performance obtained by each of the three texture descriptors calculated from the matrix-based representation, and coupled to a support vector machine classifier, is lower than the performance obtained by other vector-based descriptors based on physicochemical properties proposed in the literature. Anyway the new descriptors bring different information and our tests show that the texture descriptors and the vector-based descriptors can be combined to improve the overall performance of the system.

In particular the proposed approach improves the state-of-the-art results in two out of three tested problems (HIV-1 protease cleavage site prediction dataset and membrane proteins type dataset).

Introduction

Several applications need to extract features from peptides/proteins for solving a given classification problem [1], some examples are: sub-cellular localization [2], protein–protein interactions [3], HIV-1 protease cleavage site prediction [4], [5].

Probably, the most used feature extractor for peptides and proteins is the Chou's pseudo amino acid (PseAA) composition [6]. In the literature, several variants of these descriptors have been proposed: hydropathy scales [7], [8], physicochemical distance [9], digital code [10], complexity factor [11], [12], digital signal [13], Fourier low-frequency spectrum [14], cellular automata [15], “artificial” features created by genetic programming combining one or more “original” Chou's pseudo amino acid features [16]. The interested reader can refer to [17] and [18] for a survey of the different methods for extracting features from peptides and proteins.

Most of the feature extractors proposed in the literature are based on a vectorial representation of the peptide/protein. For example, in [19] a physicochemical encoding is proposed: each amino acid is represented by a 20-dimesional vector with all values set to zero except for the one corresponding to the considered amino acid, which takes the value of the measured physicochemical property. The descriptor associated to a peptide/protein is obtained by concatenating all the 20-dimesional vectors corresponding to its amino acid sequence.

Other interesting encoding methods, here reported for completeness, are based on kernels. One of the first approaches is the Fisher kernel [20] proposed for remote homology detection. A different kernel, the mismatch string kernel, is proposed in [21], which measures similarity among two sequences of amino acids based on shared occurrences of subsequences. In [21] it is shown that string kernels have performance similar to Fisher kernel with a lower computational cost. A class of new kernels is developed in [22] which obtain good performance for predicting protein sub-cellular localization: a set of kernel functions derived from k-peptide vectors mapped by a matrix of high-scored pairs, measured by BLOSUM62 scores, of k-peptides, are used for training a support vector machine. Another interesting approach is the bio-basis function neural network [23], in this method the sequences are not encoded in a feature space; instead, the distances obtained by sequence alignment are used to train the neural network.

The aim of this paper is to propose a novel descriptor obtained from a matrix representation of the peptides/proteins. Analogously to many of the above cited methods the physicochemical properties are considered to discriminate among the amino acids: each descriptor, which is a squared matrix of the dimension of the peptide/protein, is obtained considering a partial ordering of the amino acids of the peptide/protein according to a given physicochemical property. A more compact representation of this matrix descriptor is obtained by considering such matrix as an image and using a texture descriptor to obtain a scale-invariant representation, independent on the length of the peptide\protein. Several well-known texture descriptors are tested in this paper: local binary pattern (LBP), which extracts a histogram that describes the difference between each matrix point and its neighborhood; discrete cosine transform (DCT); Daubechies wavelet, which performs a multi-resolution analysis of the image.

The experimental section reports several tests on the following datasets: vaccine dataset for the predictions of peptides that bind human leukocyte antigens; HIV-1 protease cleavage site prediction dataset and membrane proteins type dataset. Our results show that the proposed descriptors obtain valuable classification accuracy and can be considered for a fusion with other standard descriptors to further improve the classification performance.

The remaining of the paper is organized as follows. Section 2 briefly reviews the related works on the three applications tested in this paper; Section 3 introduces the feature extraction method proposed in this work; Section 4 reports experimental results obtained on three different classification problems; finally, Section 5 draws some conclusions.

Section snippets

HIV-1 protease cleavage site prediction

For the replication of the AIDS virus, the HIV-1 protease [24], [25], [26] is essential. The inhibitors of the protease bind the active site in HIV-1 protease and do not permit the normal functioning of the protease. In the literature, several methods for HIV-1 protease cleavage sites in proteins prediction are published, most based on machine learning systems: in [27], [28], [29] a standard feed-forward multilayer perceptron (MLP) is proposed to outperform the decision tree classifier; in [24]

System description

The system proposed in this work is based on a matrix representation of the peptide/protein, which is treated as an image and characterized using a texture descriptor. A SVM classifier is trained using the extracted texture features to perform the classification task. A graphical schema of the proposed system is reported in Fig. 1. The following subsections describe the main steps of the approach.

Datasets and protocols

The proposed system has been tested using the following datasets:

  • -

    HIV: This dataset [48] is the biggest dataset ever tested for the HIV-1 protease problem. It contains 1625 octamer protein sequences: 374 HIV-1 protease cleavable sites; 1251 uncleavable sites. In this dataset the ten-fold cross-validation testing protocol is used.

  • -

    Vaccine (VAC): This dataset [34] contains peptides from five HLA-A2 molecules that bind/non-bind multiple HLA. The testing protocol suggested in [34] has been adopted,

Conclusion

In this paper, we have presented a novel method for describing the peptides/proteins, based on the calculation of texture descriptors from a matrix representation of the peptides/proteins. The novel method is based on the selection of a physicochemical property which is used to construct a representation of the peptide/protein as a matrix (using a method based on the Hasse matrix); this matrix representation is considered as an image and several texture descriptors are extracted and used for

References (53)

  • H.B. Shen et al.

    HIVcleave: a web-server for predicting HIV protease cleavage sites in proteins

    Anal Biochem

    (2008)
  • V. Brusic et al.

    Computational methods for prediction of T-cell epitopes a framework for modelling, testing, and applications

    Methods

    (2004)
  • X. Pu et al.

    Prediction of membrane protein types from sequences and position-specific scoring matrices

    J Theor Biol

    (2007)
  • P. Pudil et al.

    Floating search methods in feature selection

    Pattern Recognit Lett

    (1994)
  • L.I. Kuncheva

    Diversity in multiple classifier systems

    Inform Fusion

    (2005)
  • K.C. Chou et al.

    Review: prediction of protein structural classes

    Crit Rev Biochem Mol Biol

    (1995)
  • L. Nanni et al.

    An ensemble of k-local hyperplane for predicting protein–protein interactions

    BioInformatics

    (2006)
  • L. Nanni et al.

    A genetic approach for building different alphabets for peptide and protein classification

    BMC Bioinform

    (2008)
  • K.C. Chou et al.

    Predicting protein–protein interactions from sequences in a hybridization space

    J Proteome Res

    (2006)
  • M. Wang et al.

    Weighted-support vector machines for predicting membrane protein types based on pseudo amino acid composition

    Protein Eng Design Select

    (2004)
  • K.C. Chou

    Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

    Bioinformatics

    (2005)
  • Y. Gao et al.

    Using pseudo amino acid composition to predict protein subcellular location: approached with Lyapunov index, Bessel function, and Chebyshev filter

    Amino Acids

    (2005)
  • X. Xiao et al.

    Using complexity measure factor to predict protein subcellular location

    Amino Acids

    (2005)
  • X. Xiao et al.

    Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor

    J Comput Chem

    (2006)
  • X. Xiao et al.

    Digital coding of amino acids based on hydrophobic index

    Protein Peptide Lett

    (2007)
  • X. Xiao et al.

    Using cellular automata images and pseudo amino acid composition to predict protein subcellular location

    Amino Acids

    (2006)
  • Cited by (0)

    View full text