Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition
Introduction
The three-dimensional structure of a protein and its biological function are related: knowing a protein’s structure is to know something about its biological function (Anfinsen, 1973). Since proper identification of a protein’s structure assists researchers in predicting its function, it is no surprise that for the last twenty years the identification of protein structures has become one of the most active areas of research in computational biology, proteomics, and bioinformatics.
Structural features of proteins are typically described at four levels of complexity. The primary structure of a protein is essentially a polymer of 20 amino acids that constitutes the polypeptide chain. These amino acids are responsible for many functions in living organisms and provide clues for predicting secondary and tertiary structures and functions from polygenetic sequences. The secondary protein structure refers to the localized organization of parts of the polypeptide chain, which take on specific geometric arrangements−helices (α structures), strands (β sheets), and coils. The tertiary protein structure is the final specific geometric shape that a protein assumes, i.e., the complex and irregular folding of the peptide chain in three-dimensions. The quaternary structure describes the interactions between different peptide chains that make up the protein. Based on the work of Levitt and Chothia (1976), four structural classes of proteins are generally identified: all-α, which includes proteins with few strands; all-β, which includes proteins with few helices; α+β, which includes proteins with both helices and strands but with the strands segregated; and α/β, which includes proteins with both helices and strands but with the strands interspersed.
Protein structure prediction is the prediction of a protein’s structure given its primary structure, or AA sequence. Predicting protein structure is an extremely difficult problem and is typically based on manually identifying the similarity of the folding patterns of already existing protein structures. Many large-scale sequencing projects have produced a tremendous amount of data on protein sequences, creating a huge gap between the number of identified sequences and the number of identified protein structures (Rost and Sander, 1996). Automated computational methods capable of fast and accurate prediction of protein structures would not only reduce this gap but also further our understanding of protein heterogeneity, protein–protein interactions, and protein–peptide interactions, which in turn would lead to better diagnostic tools and methods for predicting protein/drug interactions.
Key to the success in developing automated systems for protein structure prediction is the method of protein representation. Many methods of structural protein prediction are based on the simple amino acid composition (AAC), which represents a protein as a 20-dimensional vector corresponding to the frequencies of the 20 AAs in a given protein sequence (Chou, 1995, Nakashima et al., 1986). However, because this method ignores important sequential information and because similar AA sequences share similar folding patterns, AAC representations produce mediocre results. Chou’s pseudo amino acid (PseAA) composition (Chou and Shen, 2007), one of the most studied methods of primary protein representation, overcomes some of the weaknesses of AAC by retaining additional information regarding a protein’s sequential order along with the first 20 factors representing the components of the conventional ACC (Chou, 2001, Chou, 2009). Because the PseAAC approach (Chou, 2001, Chou, 2005) or Chou’s PseAAC (Lin and Lapointe, 2013) has been widely and increasingly used, in addition to the web-server ‘PseAAC’ (Shen and Chou, 2008) built in 2008, recently three powerful open access softwares, called ‘PseAAC-Builder’ (Du et al. 2012), ‘propy’ (Cao et al., 2013), and ‘PseAAC-General’ (Du et al. 2014), were established: the former two are for generating various modes of Chou’s special PseAAC; while the 3rd one for those of Chou’s general PseAAC. PseAA has proven particularly effective in the prediction of protein structure on datasets with high-similarity but preforms poorly on datasets with low-similarity. To overcome this problem, Chou and Cai (2004) have proposed a representation called functional domain composition that is designed to handle low-similarity datasets.
Since proteins within the same class but with low sequence similarity show high similarity in their secondary structural elements, several methods have been proposed which utilize additional secondary structural prediction representations (Ding et al., 2012, Kong et al., 2014, Kurgan et al., 2008, Liu and Jia, 2010, Mizianty and Kurgan, 2009, Yang et al., 2010, Zhang et al., 2011). In (Jones, 1999), for instance, the authors use a protein’s secondary structure representation for protein structural prediction that is based on the position specific scoring matrices (PSSM) generated by PSI-BLAST. Although such methods work relatively well for most classes, α/β and, especially, α+β classes have proven problematic. Moreover, most secondary structural representations focus on the content of secondary elements and ignore the useful information found in the position of secondary elements. The recent descriptors proposed by Kong et al. (2014) have produced promising results with α/β and α+β classes, and Dai et al. (2013) have proposed a powerful set of secondary features for protein structure classification, which they call position-based features of predicted secondary structural elements (PBF-PSSEs), that takes into consideration the position of secondary elements.
In addition to approaches based on ACC and PSSM, a large body of research has focused on the physicochemical and biochemical properties of individual amino acids, and representations of proteins have been proposed for predicting protein structural classes that are based on these properties (Bu et al., 1999). Using the physicochemical properties, a protein can be represented by a set of 20 numerical values taken from the amino acid index (AAindex) (Kawashima and Kanehisa, 2000).
Our goal in this work is to develop a system that performs better than previous predictors for protein structure classification. We accomplish this goal by building an ensemble of SVMs, where each SVM is trained using a different protein descriptor based on the following protein representations1 (all of which are described in detail in Section 2):
- •
Position specific scoring matrix (PSSM) of proteins;
- •
Substitution matrix (SM) representation2;
- •
Secondary structure elements (SSE) sequence.
In Section 2, we propose new descriptors based on SM and SSE. We test our ensemble of descriptors on three large datasets that are well-established in the literature. As reported in Section 3, our system significantly outperforms previous state-of-the-art approaches.
Section snippets
Pattern representation and feature extraction
In this study we apply a machine learning approach to the protein structure classification problem. The goal is to find an effective yet compact representation of proteins that is based on a fixed-length encoding scheme that can be coupled with a general purpose classifier, an approach that has been applied successfully to other biological problems, e.g., subcellular localization and protein–protein interactions (Chou and Shen, 2007, Nanni et al., 2010).
In this paper we investigate different
Benchmark datasets
The present study has been evaluated on three widely used low-similarity benchmark datasets: FC699 (Kurgan et al., 2008), 1189 (Wang and Yuan, 2000), and 640 (Yang et al., 2010). Table 1 presents for each database the number of samples in each of the four structural classes: all-α, all-β, α+β, and α/β.
As suggested in Kong et al. (2014) the 25PDB dataset (Kurgan and Homaeian, 2006) is used as training set, while the other three datasets FC699, 1189 and 640 are used as test sets (independent
Conclusion
In this work we present a system for predicting protein structure classes using different protein representations and descriptors for training an ensemble of SVMs. The features that describe a given protein are obtained using representations based on PSSM, SM, and SSE. We present an empirical study where different feature extraction methods for representing proteins are compared and combined. Moreover, novel configurations are proposed and evaluated. The best performance is obtained when the
References (51)
Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review)
J. Theor. Biol.
(2011)- et al.
Predicting protein structural class by functional domain composition
Biochem. Biophys. Res. Commun.
(2004) A novel protein structural classes prediction method based on predicted secondary structure
Biochimie
(2012)- et al.
PseAAC-builder: a cross-platform stand-alone program for generating various special Chou’s pseudo9amino acid compositions
Anal. Biochem.
(2012) Protein secondary structure prediction based on position specific scoring matrices
J. Mol. Biol.
(1999)- et al.
Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition
J. Theor. Biol.
(2014) - et al.
Prediction of structural classes for protein sequences and domains-impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy
Pattern Recognit
(2006) - et al.
A high-accuracy protein structural class prediction algorithm using predicted secondary structural information
J. Theor. Biol.
(2010) - et al.
A high performance set of PseAAC descriptors extracted from the amino acid sequence for protein classification
J. Theor. Biol.
(2010) - et al.
PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition
Anal. Biochem.
(2008)
A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition
J. Theor. Biol.
iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types
Anal. Biochem.
High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure
Biochimie
Principles that govern the folding of protein chains
Science
A new representation for protein secondary structure prediction based on frequent patterns
Bioinformatics
Prediction of protein (domain) structural classes based on amino-acid index
Eur. J. Biochem.
Propy: a tool to generate various modes of Chou’s PseAAC
Bioinformatics
Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical–chemical interactions and similarities
PLoS One
A novel approach to predicting protein structural classes in a (20-1)-d amino acid composition space
Proteins
Prediction of protein cellular attributes using pseudo-amino acid composition
Proteins: Struct., Fucnt. Genet.
Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes
Bioinformatics
Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology
Curr. Proteom.
Some remarks on predicting multi-label attributes in molecular biosystems
Mol. Biosyst.
Review: recent progresses in protein subcellular location prediction
Anal. Biochem.
Cited by (101)
Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm
2020, Computational Biology and ChemistryPredicting the impacts of mutations on protein-ligand binding affinity based on molecular dynamics simulations and machine learning methods
2020, Computational and Structural Biotechnology JournalSPrenylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins
2019, Journal of Theoretical BiologyiPPI-PseAAC(CGR): identify protein-protein interactions by incorporating chaos game representation into PseAAC
2019, Journal of Theoretical Biology