Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition

doi:10.1016/j.jtbi.2014.07.003

Journal of Theoretical Biology

Volume 360, 7 November 2014, Pages 109-116

https://doi.org/10.1016/j.jtbi.2014.07.003 Get rights and content

Highlights

•
Protein structure identification
•
Ensemble of several protein descriptors.
•
Descriptors extracted from different protein representations.
•
Support vector machines.

Abstract

Successful protein structure identification enables researchers to estimate the biological functions of proteins, yet it remains a challenging problem. The most common method for determining an unknown protein’s structural class is to perform expensive and time-consuming manual experiments. Because of the availability of amino acid sequences generated in the post-genomic age, it is possible to predict an unknown protein’s structural class using machine learning methods given a protein’s amino-acid sequence and/or its secondary structural elements. Following recent research in this area, we propose a new machine learning system that is based on combining several protein descriptors extracted from different protein representations, such as position specific scoring matrix (PSSM), the amino-acid sequence, and secondary structural sequences. The prediction engine of our system is operated by an ensemble of support vector machines (SVMs), where each SVM is trained on a different descriptor. The results of each SVM are combined by sum rule. Our final ensemble produces a success rate that is substantially better than previously reported results on three well-established datasets. The MATLAB code and datasets used in our experiments are freely available for future comparison at http://www.dei.unipd.it/node/2357.

Introduction

The three-dimensional structure of a protein and its biological function are related: knowing a protein’s structure is to know something about its biological function (Anfinsen, 1973). Since proper identification of a protein’s structure assists researchers in predicting its function, it is no surprise that for the last twenty years the identification of protein structures has become one of the most active areas of research in computational biology, proteomics, and bioinformatics.

Structural features of proteins are typically described at four levels of complexity. The primary structure of a protein is essentially a polymer of 20 amino acids that constitutes the polypeptide chain. These amino acids are responsible for many functions in living organisms and provide clues for predicting secondary and tertiary structures and functions from polygenetic sequences. The secondary protein structure refers to the localized organization of parts of the polypeptide chain, which take on specific geometric arrangements−helices (α structures), strands (β sheets), and coils. The tertiary protein structure is the final specific geometric shape that a protein assumes, i.e., the complex and irregular folding of the peptide chain in three-dimensions. The quaternary structure describes the interactions between different peptide chains that make up the protein. Based on the work of Levitt and Chothia (1976), four structural classes of proteins are generally identified: all-α, which includes proteins with few strands; all-β, which includes proteins with few helices; α+β, which includes proteins with both helices and strands but with the strands segregated; and α/β, which includes proteins with both helices and strands but with the strands interspersed.

Protein structure prediction is the prediction of a protein’s structure given its primary structure, or AA sequence. Predicting protein structure is an extremely difficult problem and is typically based on manually identifying the similarity of the folding patterns of already existing protein structures. Many large-scale sequencing projects have produced a tremendous amount of data on protein sequences, creating a huge gap between the number of identified sequences and the number of identified protein structures (Rost and Sander, 1996). Automated computational methods capable of fast and accurate prediction of protein structures would not only reduce this gap but also further our understanding of protein heterogeneity, protein–protein interactions, and protein–peptide interactions, which in turn would lead to better diagnostic tools and methods for predicting protein/drug interactions.

Key to the success in developing automated systems for protein structure prediction is the method of protein representation. Many methods of structural protein prediction are based on the simple amino acid composition (AAC), which represents a protein as a 20-dimensional vector corresponding to the frequencies of the 20 AAs in a given protein sequence (Chou, 1995, Nakashima et al., 1986). However, because this method ignores important sequential information and because similar AA sequences share similar folding patterns, AAC representations produce mediocre results. Chou’s pseudo amino acid (PseAA) composition (Chou and Shen, 2007), one of the most studied methods of primary protein representation, overcomes some of the weaknesses of AAC by retaining additional information regarding a protein’s sequential order along with the first 20 factors representing the components of the conventional ACC (Chou, 2001, Chou, 2009). Because the PseAAC approach (Chou, 2001, Chou, 2005) or Chou’s PseAAC (Lin and Lapointe, 2013) has been widely and increasingly used, in addition to the web-server ‘PseAAC’ (Shen and Chou, 2008) built in 2008, recently three powerful open access softwares, called ‘PseAAC-Builder’ (Du et al. 2012), ‘propy’ (Cao et al., 2013), and ‘PseAAC-General’ (Du et al. 2014), were established: the former two are for generating various modes of Chou’s special PseAAC; while the 3rd one for those of Chou’s general PseAAC. PseAA has proven particularly effective in the prediction of protein structure on datasets with high-similarity but preforms poorly on datasets with low-similarity. To overcome this problem, Chou and Cai (2004) have proposed a representation called functional domain composition that is designed to handle low-similarity datasets.

Since proteins within the same class but with low sequence similarity show high similarity in their secondary structural elements, several methods have been proposed which utilize additional secondary structural prediction representations (Ding et al., 2012, Kong et al., 2014, Kurgan et al., 2008, Liu and Jia, 2010, Mizianty and Kurgan, 2009, Yang et al., 2010, Zhang et al., 2011). In (Jones, 1999), for instance, the authors use a protein’s secondary structure representation for protein structural prediction that is based on the position specific scoring matrices (PSSM) generated by PSI-BLAST. Although such methods work relatively well for most classes, α/β and, especially, α+β classes have proven problematic. Moreover, most secondary structural representations focus on the content of secondary elements and ignore the useful information found in the position of secondary elements. The recent descriptors proposed by Kong et al. (2014) have produced promising results with α/β and α+β classes, and Dai et al. (2013) have proposed a powerful set of secondary features for protein structure classification, which they call position-based features of predicted secondary structural elements (PBF-PSSEs), that takes into consideration the position of secondary elements.

In addition to approaches based on ACC and PSSM, a large body of research has focused on the physicochemical and biochemical properties of individual amino acids, and representations of proteins have been proposed for predicting protein structural classes that are based on these properties (Bu et al., 1999). Using the physicochemical properties, a protein can be represented by a set of 20 numerical values taken from the amino acid index (AAindex) (Kawashima and Kanehisa, 2000).

Our goal in this work is to develop a system that performs better than previous predictors for protein structure classification. We accomplish this goal by building an ensemble of SVMs, where each SVM is trained using a different protein descriptor based on the following protein representations¹ (all of which are described in detail in Section 2):

•
Position specific scoring matrix (PSSM) of proteins;
•
Substitution matrix (SM) representation²;
•
Secondary structure elements (SSE) sequence.

In Section 2, we propose new descriptors based on SM and SSE. We test our ensemble of descriptors on three large datasets that are well-established in the literature. As reported in Section 3, our system significantly outperforms previous state-of-the-art approaches.

Section snippets

Pattern representation and feature extraction

In this study we apply a machine learning approach to the protein structure classification problem. The goal is to find an effective yet compact representation of proteins that is based on a fixed-length encoding scheme that can be coupled with a general purpose classifier, an approach that has been applied successfully to other biological problems, e.g., subcellular localization and protein–protein interactions (Chou and Shen, 2007, Nanni et al., 2010).

In this paper we investigate different

Benchmark datasets

The present study has been evaluated on three widely used low-similarity benchmark datasets: FC699 (Kurgan et al., 2008), 1189 (Wang and Yuan, 2000), and 640 (Yang et al., 2010). Table 1 presents for each database the number of samples in each of the four structural classes: all-α, all-β, α+β, and α/β.

As suggested in Kong et al. (2014) the 25PDB dataset (Kurgan and Homaeian, 2006) is used as training set, while the other three datasets FC699, 1189 and 640 are used as test sets (independent

Conclusion

In this work we present a system for predicting protein structure classes using different protein representations and descriptors for training an ensemble of SVMs. The features that describe a given protein are obtained using representations based on PSSM, SM, and SSE. We present an empirical study where different feature extraction methods for representing proteins are compared and combined. Moreover, novel configurations are proposed and evaluated. The best performance is obtained when the

References (51)

K.-C. Chou
Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review)
J. Theor. Biol.
(2011)
K.-C. Chou et al.
Predicting protein structural class by functional domain composition
Biochem. Biophys. Res. Commun.
(2004)
S. Ding
A novel protein structural classes prediction method based on predicted secondary structure
Biochimie
(2012)
P Du et al.
PseAAC-builder: a cross-platform stand-alone program for generating various special Chou’s pseudo9amino acid compositions
Anal. Biochem.
(2012)
D.T. Jones
Protein secondary structure prediction based on position specific scoring matrices
J. Mol. Biol.
(1999)
L. Kong et al.
Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition
J. Theor. Biol.
(2014)
L.A. Kurgan et al.
Prediction of structural classes for protein sequences and domains-impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy
Pattern Recognit
(2006)
T. Liu et al.
A high-accuracy protein structural class prediction algorithm using predicted secondary structural information
J. Theor. Biol.
(2010)
L. Nanni et al.
A high performance set of PseAAC descriptors extracted from the amino acid sequence for protein classification
J. Theor. Biol.
(2010)
HB Shen et al.
PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition
Anal. Biochem.
(2008)

A. Sharma

A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition

J. Theor. Biol.

(2013)

X Xiao et al.

iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types

Anal. Biochem.

(2013)

S. Zhang et al.

High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure

Biochimie

(2011)

C. Anfinsen

Principles that govern the folding of protein chains

Science

(1973)

F. Birzele et al.

A new representation for protein secondary structure prediction based on frequent patterns

Bioinformatics

(2006)

W.S. Bu

Prediction of protein (domain) structural classes based on amino-acid index

Eur. J. Biochem.

(1999)

D.S. Cao et al.

Propy: a tool to generate various modes of Chou’s PseAAC

Bioinformatics

(2013)

L Chen et al.

Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical–chemical interactions and similarities

PLoS One

(2012)

Chen W, Feng PM, et al. (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition,...

K.-C. Chou

A novel approach to predicting protein structural classes in a (20-1)-d amino acid composition space

Proteins

(1995)

K.-C. Chou

Prediction of protein cellular attributes using pseudo-amino acid composition

Proteins: Struct., Fucnt. Genet.

(2001)

KC Chou

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

Bioinformatics

(2005)

K.-C. Chou

Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology

Curr. Proteom.

(2009)

KC Chou

Some remarks on predicting multi-label attributes in molecular biosystems

Mol. Biosyst.

(2013)

K.-C. Chou et al.

Review: recent progresses in protein subcellular location prediction

Anal. Biochem.

(2007)

Cited by (101)

Information entropy-based differential evolution with extremely randomized trees and LightGBM for protein structural class prediction
2023, Applied Soft Computing
The discovery of protein tertiary structure is the basis of current genetic engineering, medicinal design, and other biological applications. Protein structural class plays a significant role in the tertiary structure folding and function analysis of protein. However, the growth rate of new amino acid sequence far exceeds the tertiary structure. Existing research methods of confirming protein folding cannot satisfy massive sequences and protein engineering. A high-accuracy prediction result of low-similarity protein dataset is particularly critical to generate the corresponding tertiary structure from the primary structure. In this paper, we construct a novel super-large-scale feature of the primary structure based on secondary structure, evolutionary information, chemical properties, and global descriptors. The diversified and massive features are utilized to predict the protein class based on a novel feature selection algorithm and a gradient boosting decision tree model. To testify the effectiveness and robustness of our proposed method, namely IDEGBM, we choose the 10-fold cross-validation for evaluating four benchmark datasets 25PDB, FC699, D1189 and D640. Experimental results exhibit that our method improves the accuracy in comparison with other state-of-the-art prediction models in terms of both accuracy and efficiency. Furthermore, a representative protein is used to validate that our proposed IDEGBM can be applied to improve the conformation prediction of protein tertiary structure.
Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm
2020, Computational Biology and Chemistry
At present, tertiary structure discovery growth rate is lagging far behind discovery of primary structure. The prediction of protein structural class using Machine Learning techniques can help reduce this gap. The Structural Classification of Protein – Extended (SCOPe 2.07) is latest and largest dataset available at present. The protein sequences with less than 40% identity to each other are used for predicting α, β, α/β and α + β SCOPe classes. The sensitive features are extracted from primary and secondary structure representations of Proteins. Features are extracted experimentally from secondary structure with respect to its frequency, pitch and spatial arrangements. Primary structure based features contain species information for a protein sequence. The species parameters are further validated with uniref100 dataset using TaxId. As it is known, protein tertiary structure is manifestation of function. Functional differences are observed in species. Hence, the species are expected to have strong correlations with structural class, which is discovered in current work. It enhances prediction accuracy by 7%–10%. The subset of SCOPe 2.07 is trained using 65 dimensional feature vector using Random Forest classifier. The test result for the rest of the set gives consistent accuracy of better than 95%. The accuracy achieved on benchmark datasets ASTRAL 1.73, 25PDB and FC699 is better than 86%, 91% and 97% respectively, which is best reported to our knowledge.
Predicting the impacts of mutations on protein-ligand binding affinity based on molecular dynamics simulations and machine learning methods
2020, Computational and Structural Biotechnology Journal
Mutation-induced variation of protein-ligand binding affinity is the key to many genetic diseases and the emergence of drug resistance, and therefore predicting such mutation impacts is of great importance. In this work, we aim to predict the mutation impacts on protein-ligand binding affinity using efficient structure-based, computational methods.
Relying on consolidated databases of experimentally determined data we characterize the affinity change upon mutation based on a number of local geometrical features and monitor such feature differences upon mutation during molecular dynamics (MD) simulations. The differences are quantified according to average difference, trajectory-wise distance or time-vary differences. Machine-learning methods are employed to predict the mutation impacts using the resulting conventional or time-series features. Predictions based on estimation of energy and based on investigation of molecular descriptors were conducted as benchmarks.
Our method (machine-learning techniques using time-series features) outperformed the benchmark methods, especially in terms of the balanced F1 score. Particularly, deep-learning models led to the best prediction performance with distinct improvements in balanced F1 score and a sustained accuracy.
Our work highlights the effectiveness of the characterization of affinity change upon mutations. Furthermore, deep-learning techniques are well designed for handling the extracted time-series features. This study can lead to a deeper understanding of mutation-induced diseases and resistance, and further guide the development of innovative drug design.
iProtease-PseAAC(2L): A two-layer predictor for identifying proteases and their types using Chou's 5-step-rule and general PseAAC
2020, Analytical Biochemistry
Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent. Herein, we report a classifier named iProtease-PseAAC (2L) for identifying proteases and their classes. The predictor is developed employing the flow of 5-step rule, initiating from the collection of benchmark dataset and terminating at the development of predictor. Rigorous verification and validation tests are performed and metrics are collected to calculate the authenticity of the trained model. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. The average accuracy for level-2 i.e. protease classification is 95.77%. Based on the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases and their classes using a given protein sequence.
SPrenylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins
2019, Journal of Theoretical Biology
The protein prenylation (or S-prenylation) is one of the most essential modifications, required for the association of membrane of a plethora of signalling proteins with the key biological process such as protein trafficking, cell growth, proliferation and differentiation. Due to the ubiquitous nature of S-prenylation and its role in cellular functions, any defect in the biosynthesis or regulation of the isoprenoid leads to the occurrence of a variety of diseases including neurodegenerative disorders, metabolic issues, cardiovascular diseases and one of the most fatal diseases, cancer. This depicts the strong biological significance of S-prenylation, thus, the timely and accurate identification of S-prenylation sites is crucial and may provide with possible ways to understand the mechanism of this modification in proteins. To avoid laborious, resource demanding and expensive experimental techniques of identifying S-prenylation sites, here, we propose a novel predictor namely SPrenylC-PseAAC by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) and relative/absolute position-based features. A 2-tier classification was performed i.e., at first level, identification of prenylation and non-prenylation sites is performed, while at the second level, identification of S-farnesylation and S-geranylgeranylation sites is performed. Using jackknife, perdition model validation gave 95.31% accuracy for tier-1 classification and 91.42% for tier 2 classification, while for 10-fold cross-validation, it gave 93.68% accuracy for tier-1 classification and 89.70% for tier 2 classification. Thus the proposed predictor can help in predicting the Prenylation sites in an efficient and accurate way. The SPrenylC-PseAAC is available at (biopred.org/prenyl).
iPPI-PseAAC(CGR): identify protein-protein interactions by incorporating chaos game representation into PseAAC
2019, Journal of Theoretical Biology
Investigation into the network of protein–protein interactions (PPIs) will provide valuable insights into the inner workings of cells. Accordingly, it is crucially important to develop an automated method or high-throughput tool that can efficiently predict the PPIs. In this study, a new predictor, called “iPPI-PseAAC(CGR)”, was developed by incorporating the information of “chaos game representation” into the PseAAC (Pseudo Amino Acid Composition). The advantage by doing so is that some key sequence-order or sequence-pattern information can be more effectively incorporated during the treatment of the protein pair samples. The operation engine used in this predictor is the random forests algorithm. It has been observed via the cross-validations on the widely used benchmark datasets that the success rates achieved by the proposed predictor are remarkably higher than those by its existing counterparts. For the convenience of the most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iPPI-PseAAC(CGR), by which users can easily get their desired results without the need to go through the detailed mathematics.

View all citing articles on Scopus

View full text

Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition

Highlights

Abstract

Introduction

Section snippets

Pattern representation and feature extraction

Benchmark datasets

Conclusion

J. Theor. Biol.

Biochem. Biophys. Res. Commun.

Biochimie

Anal. Biochem.

J. Mol. Biol.

J. Theor. Biol.

Pattern Recognit

J. Theor. Biol.

J. Theor. Biol.

Anal. Biochem.

J. Theor. Biol.

Anal. Biochem.

Biochimie

Principles that govern the folding of protein chains

Science

A new representation for protein secondary structure prediction based on frequent patterns

Bioinformatics

Prediction of protein (domain) structural classes based on amino-acid index

Eur. J. Biochem.

Propy: a tool to generate various modes of Chou’s PseAAC

Bioinformatics

Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical–chemical interactions and similarities

PLoS One

A novel approach to predicting protein structural classes in a (20-1)-d amino acid composition space

Proteins

Prediction of protein cellular attributes using pseudo-amino acid composition

Proteins: Struct., Fucnt. Genet.

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

Bioinformatics

Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology

Curr. Proteom.

Some remarks on predicting multi-label attributes in molecular biosystems

Mol. Biosyst.

Review: recent progresses in protein subcellular location prediction

Anal. Biochem.