Protein motifs retrieval by SS terns occurrences

https://doi.org/10.1016/j.patrec.2012.12.003Get rights and content

Abstract

This paper describes a new approach to the analysis of protein 3D structure based on the Secondary Structure (SS) representation. The focus is here on structural motif retrieval. The strategy is derived from the Generalized Hough Transform (GHT), but considering as structural primitive element, the triplet of SSs. The triplet identity is evaluated on the triangle having the vertices on the SS midpoints, and is represented by the three midpoints distances. The motif is characterized by the complete set of triplets, so the Reference Table (RT) has a tuple for each triplet. Tuples contain, beside the discriminant component (the three edge lengths), the mapping rule, i.e. the Reference Point (RP) location referred to the triplet. In the macromolecule to be analyzed, each possible triplet is searched in the RT and every match gives a contribution to a candidate location of the RP. Presence and location of the searched motif are certified by the collection of a number of contribution equal (obviously in absence of noise and ambiguities) to the RT cardinality (i.e. the number of motif triplets). The approach is tested on twenty proteins selected randomly from the PDB, but having a different number of SSs ranging from 14 to 46. The retrieval of all possible structural blocks composed by three, four and five SSs (very compact and completely distributed) have been conducted. The results show valuable performances for precision and computation time.

Highlights

► Protein structure understanding is central to predict protein function and evolution. ► We describe a new approach to analyze protein in 3D by Secondary Structures (SSs). ► We use the G-Hough Transform considering SS triplets as structural primitive elements. ► The goal is the retrieval of structural blocks (motifs) composed by three to five SSs. ► Over 7.5 million cases show valuable performances for precision and computation time.

Introduction

Many evolutionarily and functionally meaningful links between proteins come to light through the analysis of their spatial 3D structures. Protein structure and morphology are significant to understand and predict their functionality (Shuoyong et al., 2007). Protein structure comparison is an important issue that helps biologists to understand various aspects of protein function and evolution. For this reason protein comparison and retrieval are basic issues that helps biologists to comprehend various aspects of the phylogenetic evaluation and of the tasks performed i.e. proteins role in the machinery of life.

The protein 3D structure is vitally important in many biological applications, such as rational drug design. The retrieval of a protein 3D structure can be achieved by different experimental and bioinformatics methods. To this aim, X-ray crystallography is a powerful tool although time-consuming, expensive, and not feasible for all proteins (e.g. so far very few membrane protein structures have been determined). Nuclear magnetic resonance (NMR) is another tool that can be employed to determine the 3D structures of membrane proteins, even though time-consuming and costly. In order to acquire the structural information in a timely manner, it is possible to adopt various bioinformatics tools (see, e.g. (Li et al., 2011, Ma et al., 2012, Wang and Chou, 2011, Chou et al., 1997, Wang and Chou, 2012) and a review Chou, 2005). The present study is devoted to develop a novel method to search a database of protein structures for 3D patterns of secondary structural elements.

Structural comparison and protein structure retrieval problems have been studied in the structural biology community. In most cases just representing the set of the protein by a set of SS elements. Can and Wang (2003) present a new method for conducting protein structure similarity searches and applies differential geometry knowledge on their 3D structure for extracting “signatures” such as curvature, torsion and SS type. Camoglu et al. (2003), to find similarities in protein database, build an indexing structure based on SS elements triplets by using R-tree. Chionh et al. (2003) propose the SCALE algorithm to compare protein 3D structures through matrices that utilizes angles and distances between SS elements. Krissinel and Henrick (2004) describe the Secondary Structure Matching (SSM) algorithm for comparison in 3D, including an original procedure for matching graphs built on the protein’s SS elements, that is followed by an iterative 3D alignment of protein backbone Cα atoms. Chi et al. (2004) design a fast system for protein structural block retrieval by using image based distance matrices and multidimensional indices. The 1D string representation of local protein structure retains a degree of structural information. This type of representation can be a powerful tool for comparison and classification. Friedberg et al. (2006) described the use of a particular structure fragment library, denoted as KL-strings, for the 1D representation of protein structure and developed an infrastructure for comparing structures with 1D representation. Shuoyong et al. (2007) developed a program, ProSMoS (Protein Structure Motif Search) to find fold-level structural similarities and to search for the presence of structural motifs. This package searches a library of protein structures for user defined 3D patterns of SS elements. Also a web server to make a pattern-based search, using interaction matrix representation of protein structures (Shuoyong et al. (2009)), has been developed. Albrecht et al. (2008) propose a different approach and apply data reduction techniques directly to the protein structure and convert 3D data into 2D so accelerating the structural comparisons. Zotenko et al. (2007) propose an approach to speed up protein comparison by mapping a protein structure to a high-dimensional vector and approximating structural similarity by suitable distances between the corresponding vectors. Zhang et al. (2009) by a transition probability matrix and some structural characteristic vectors of proteins developed FDOD (Function of Degree of Disagreement) a score scheme to measure the protein similarity. Nguyen and Madhusudhan (2011) propose a new algorithm, CLICK, to capture such similarities. This method optimally superimposes a pair of protein structures independently of their topology and can generally be applied to compare any pair of molecular structures represented in Cartesian coordinates as exemplified by the RNA structure superimposition benchmark. Cantoni and Mattia, 2012, Cantoni et al., 2012 made a study for retrieving structural motifs by using GHT and range tree. This approach is completely new, because the analysis is based on the 3D spatial distribution of the SS.

In this paper, a new approach for structural block retrieval based on protein SS comparison is proposed. Here, triangles joining the middle points of the SS triplets are considered as “structural elements” and all the block triangles are compared with all the macromolecule triangles. The focus of the paper is on the retrieval of an existing structural block completely and precisely known. The block can be defined without constraints such as adjacency, distance limits, homogeneity, etc. The only constraints is that the SS components exist in the protein macromolecule.

The rest of the paper is organized as following. Section II introduces the GHT and the triangle approaches. Section III represents the experiments and their results. In the final session IV a brief discussion and the future works are described.

Section snippets

Methodology

In this paper a novel approach, GHT-based, for motif retrieval is proposed. The GHT is used for comparison and search of structural similarity between a given structural block (a motif or a domain or the entire protein) and the proteins of a database like the PDB. Note that, if the searched structure is just a component of a protein (like a structural motif or a domain) the same algorithm supports the detection and the statistical distribution of these components.The primitive patterns to which

Experiments and performances

The aim of this experiment is to test precision and computation time of the proposed method.

In order to assess the statistical performances the following three cross-validation methods are often used: independent dataset test, subsampling (or K-fold cross validation) test, and jackknife test (Chou and Zhang, 1995). In particular, the jackknife test is considered less arbitrary in that it always produces a unique result for a given dataset. The rationale is: (i) for the independent dataset test,

Conclusions

Comparing protein structures and retrieving motif remain an active area of development in structural biology. The new approach refers to the structural analysis of the 3D distribution of SSs. In this paper the problem of combining SS triplets for searching general motifs (details are given for the cases of three, four and five SSs), in protein structure datasets is considered. The comparison is conducted, by considering triangles as primitives (or, as basic structural elements) using motif and

References (32)

  • C. Chen et al.

    Dual-layer wavelet svm for predicting protein structural class via the general form of Chou’s pseudo amino acid composition

    Protein Pept. Lett.

    (2012)
  • P.H. Chi et al.

    A fast protein structure retrieval system using image based distance matrices and multidimensional index

    Internat. J. Software Eng. Knowl. Eng.

    (2004)
  • Chionh, C.H., Haung, Z., Tan, K.L., Yao, Z., 2003. Augmenting SSEs with Structural Properties for Rapid Protein...
  • K.C. Chou

    Prediction of protein cellular attributes using pseudo amino acid composition

    Proteins: Struct. Funct. Genetics

    (2001)
  • K.C. Chou

    Coupling interaction between thromboxane A2 receptor and alpha-13 subunit of guanine nucleotide-binding protein

    J. Proteome Res.

    (2005)
  • K.C. Chou et al.

    Review: Recent advances in developing web-servers for predicting protein attributes

    Nat. Sci.

    (2009)
  • Cited by (9)

    • Pattern recognition and beyond: Alfredo Petrosino's scientific results

      2020, Pattern Recognition Letters
      Citation Excerpt :

      In [13] and [15], a novel 3D structural representations of the proteins is proposed and adopted to exploit the learning capabilities of unsupervised (SOM) and supervised (G-NN) techniques, respectively. For searching protein structural similarities in databases, in [14], protein motifs retrieval is tackled based on the use of the generalized 3D Hough transform. Recent research finds its application in the design of biometric systems.

    • World Competitive Contests (WCC) algorithm: A novel intelligent optimization algorithm for biological and non-biological problems

      2016, Informatics in Medicine Unlocked
      Citation Excerpt :

      Some of the other related works that we can refer to are as follows: Specifications of the monocyte activating motif in the mycobacterium (mycobacterial) tuberculosis [17], motifs retrieval by the Secondary Structure terns occurrences [18], and the interaction of binding motif within the nucleocapsid protein of porcine reproductive and respiratory syndrome virus and the host cellular signaling proteins [19]. Proposed optimization algorithm starts with the first population of teams [20].

    • Geometrical motifs search in proteins: A parallel approach

      2015, Parallel Computing
      Citation Excerpt :

      The computational complexity of the GHT can be quite relevant, since it depends on the number elements that make up the model, (that is, the cardinality of the reference table), on the number of feature elements present in the feature space to be analyzed, and on the resolution at which the voting space is quantized. The Secondary Structures Co-occurrences (SSC) [2,3] and the Secondary Structures Triplets (SST) [14] are two algorithms, based on the GHT, which search for geometrical motifs (patterns) of SSEs (feature elements) inside a given protein (search space). The SSC algorithm uses pairs (co-occurrences) of SSEs, while SST uses terns.

    • A method of protein model classification and retrieval using bag-of-visual-features

      2014, Computational and Mathematical Methods in Medicine
    • Motifs and structural blocks retrieval by GHT

      2014, European Physical Journal Plus
    • CCMS: A greedy approach to motif extraction

      2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text