Elsevier

Journal of Theoretical Biology

Volume 360, 7 November 2014, Pages 109-116
Journal of Theoretical Biology

Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition

https://doi.org/10.1016/j.jtbi.2014.07.003Get rights and content

Highlights

  • Protein structure identification

  • Ensemble of several protein descriptors.

  • Descriptors extracted from different protein representations.

  • Support vector machines.

Abstract

Successful protein structure identification enables researchers to estimate the biological functions of proteins, yet it remains a challenging problem. The most common method for determining an unknown protein’s structural class is to perform expensive and time-consuming manual experiments. Because of the availability of amino acid sequences generated in the post-genomic age, it is possible to predict an unknown protein’s structural class using machine learning methods given a protein’s amino-acid sequence and/or its secondary structural elements. Following recent research in this area, we propose a new machine learning system that is based on combining several protein descriptors extracted from different protein representations, such as position specific scoring matrix (PSSM), the amino-acid sequence, and secondary structural sequences. The prediction engine of our system is operated by an ensemble of support vector machines (SVMs), where each SVM is trained on a different descriptor. The results of each SVM are combined by sum rule. Our final ensemble produces a success rate that is substantially better than previously reported results on three well-established datasets. The MATLAB code and datasets used in our experiments are freely available for future comparison at http://www.dei.unipd.it/node/2357.

Introduction

The three-dimensional structure of a protein and its biological function are related: knowing a protein’s structure is to know something about its biological function (Anfinsen, 1973). Since proper identification of a protein’s structure assists researchers in predicting its function, it is no surprise that for the last twenty years the identification of protein structures has become one of the most active areas of research in computational biology, proteomics, and bioinformatics.

Structural features of proteins are typically described at four levels of complexity. The primary structure of a protein is essentially a polymer of 20 amino acids that constitutes the polypeptide chain. These amino acids are responsible for many functions in living organisms and provide clues for predicting secondary and tertiary structures and functions from polygenetic sequences. The secondary protein structure refers to the localized organization of parts of the polypeptide chain, which take on specific geometric arrangements−helices (α structures), strands (β sheets), and coils. The tertiary protein structure is the final specific geometric shape that a protein assumes, i.e., the complex and irregular folding of the peptide chain in three-dimensions. The quaternary structure describes the interactions between different peptide chains that make up the protein. Based on the work of Levitt and Chothia (1976), four structural classes of proteins are generally identified: all-α, which includes proteins with few strands; all-β, which includes proteins with few helices; α+β, which includes proteins with both helices and strands but with the strands segregated; and α/β, which includes proteins with both helices and strands but with the strands interspersed.

Protein structure prediction is the prediction of a protein’s structure given its primary structure, or AA sequence. Predicting protein structure is an extremely difficult problem and is typically based on manually identifying the similarity of the folding patterns of already existing protein structures. Many large-scale sequencing projects have produced a tremendous amount of data on protein sequences, creating a huge gap between the number of identified sequences and the number of identified protein structures (Rost and Sander, 1996). Automated computational methods capable of fast and accurate prediction of protein structures would not only reduce this gap but also further our understanding of protein heterogeneity, protein–protein interactions, and protein–peptide interactions, which in turn would lead to better diagnostic tools and methods for predicting protein/drug interactions.

Key to the success in developing automated systems for protein structure prediction is the method of protein representation. Many methods of structural protein prediction are based on the simple amino acid composition (AAC), which represents a protein as a 20-dimensional vector corresponding to the frequencies of the 20 AAs in a given protein sequence (Chou, 1995, Nakashima et al., 1986). However, because this method ignores important sequential information and because similar AA sequences share similar folding patterns, AAC representations produce mediocre results. Chou’s pseudo amino acid (PseAA) composition (Chou and Shen, 2007), one of the most studied methods of primary protein representation, overcomes some of the weaknesses of AAC by retaining additional information regarding a protein’s sequential order along with the first 20 factors representing the components of the conventional ACC (Chou, 2001, Chou, 2009). Because the PseAAC approach (Chou, 2001, Chou, 2005) or Chou’s PseAAC (Lin and Lapointe, 2013) has been widely and increasingly used, in addition to the web-server ‘PseAAC’ (Shen and Chou, 2008) built in 2008, recently three powerful open access softwares, called ‘PseAAC-Builder’ (Du et al. 2012), ‘propy’ (Cao et al., 2013), and ‘PseAAC-General’ (Du et al. 2014), were established: the former two are for generating various modes of Chou’s special PseAAC; while the 3rd one for those of Chou’s general PseAAC. PseAA has proven particularly effective in the prediction of protein structure on datasets with high-similarity but preforms poorly on datasets with low-similarity. To overcome this problem, Chou and Cai (2004) have proposed a representation called functional domain composition that is designed to handle low-similarity datasets.

Since proteins within the same class but with low sequence similarity show high similarity in their secondary structural elements, several methods have been proposed which utilize additional secondary structural prediction representations (Ding et al., 2012, Kong et al., 2014, Kurgan et al., 2008, Liu and Jia, 2010, Mizianty and Kurgan, 2009, Yang et al., 2010, Zhang et al., 2011). In (Jones, 1999), for instance, the authors use a protein’s secondary structure representation for protein structural prediction that is based on the position specific scoring matrices (PSSM) generated by PSI-BLAST. Although such methods work relatively well for most classes, α/β and, especially, α+β classes have proven problematic. Moreover, most secondary structural representations focus on the content of secondary elements and ignore the useful information found in the position of secondary elements. The recent descriptors proposed by Kong et al. (2014) have produced promising results with α/β and α+β classes, and Dai et al. (2013) have proposed a powerful set of secondary features for protein structure classification, which they call position-based features of predicted secondary structural elements (PBF-PSSEs), that takes into consideration the position of secondary elements.

In addition to approaches based on ACC and PSSM, a large body of research has focused on the physicochemical and biochemical properties of individual amino acids, and representations of proteins have been proposed for predicting protein structural classes that are based on these properties (Bu et al., 1999). Using the physicochemical properties, a protein can be represented by a set of 20 numerical values taken from the amino acid index (AAindex) (Kawashima and Kanehisa, 2000).

Our goal in this work is to develop a system that performs better than previous predictors for protein structure classification. We accomplish this goal by building an ensemble of SVMs, where each SVM is trained using a different protein descriptor based on the following protein representations1 (all of which are described in detail in Section 2):

  • Position specific scoring matrix (PSSM) of proteins;

  • Substitution matrix (SM) representation2;

  • Secondary structure elements (SSE) sequence.

In Section 2, we propose new descriptors based on SM and SSE. We test our ensemble of descriptors on three large datasets that are well-established in the literature. As reported in Section 3, our system significantly outperforms previous state-of-the-art approaches.

Section snippets

Pattern representation and feature extraction

In this study we apply a machine learning approach to the protein structure classification problem. The goal is to find an effective yet compact representation of proteins that is based on a fixed-length encoding scheme that can be coupled with a general purpose classifier, an approach that has been applied successfully to other biological problems, e.g., subcellular localization and protein–protein interactions (Chou and Shen, 2007, Nanni et al., 2010).

In this paper we investigate different

Benchmark datasets

The present study has been evaluated on three widely used low-similarity benchmark datasets: FC699 (Kurgan et al., 2008), 1189 (Wang and Yuan, 2000), and 640 (Yang et al., 2010). Table 1 presents for each database the number of samples in each of the four structural classes: all-α, all-β, α+β, and α/β.

As suggested in Kong et al. (2014) the 25PDB dataset (Kurgan and Homaeian, 2006) is used as training set, while the other three datasets FC699, 1189 and 640 are used as test sets (independent

Conclusion

In this work we present a system for predicting protein structure classes using different protein representations and descriptors for training an ensemble of SVMs. The features that describe a given protein are obtained using representations based on PSSM, SM, and SSE. We present an empirical study where different feature extraction methods for representing proteins are compared and combined. Moreover, novel configurations are proposed and evaluated. The best performance is obtained when the

References (51)

  • A. Sharma

    A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition

    J. Theor. Biol.

    (2013)
  • X Xiao et al.

    iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types

    Anal. Biochem.

    (2013)
  • S. Zhang et al.

    High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure

    Biochimie

    (2011)
  • C. Anfinsen

    Principles that govern the folding of protein chains

    Science

    (1973)
  • F. Birzele et al.

    A new representation for protein secondary structure prediction based on frequent patterns

    Bioinformatics

    (2006)
  • W.S. Bu

    Prediction of protein (domain) structural classes based on amino-acid index

    Eur. J. Biochem.

    (1999)
  • D.S. Cao et al.

    Propy: a tool to generate various modes of Chou’s PseAAC

    Bioinformatics

    (2013)
  • L Chen et al.

    Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical–chemical interactions and similarities

    PLoS One

    (2012)
  • Chen W, Feng PM, et al. (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition,...
  • K.-C. Chou

    A novel approach to predicting protein structural classes in a (20-1)-d amino acid composition space

    Proteins

    (1995)
  • K.-C. Chou

    Prediction of protein cellular attributes using pseudo-amino acid composition

    Proteins: Struct., Fucnt. Genet.

    (2001)
  • KC Chou

    Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

    Bioinformatics

    (2005)
  • K.-C. Chou

    Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology

    Curr. Proteom.

    (2009)
  • KC Chou

    Some remarks on predicting multi-label attributes in molecular biosystems

    Mol. Biosyst.

    (2013)
  • K.-C. Chou et al.

    Review: recent progresses in protein subcellular location prediction

    Anal. Biochem.

    (2007)
  • Cited by (101)

    View all citing articles on Scopus
    View full text