Journal of Molecular Biology
Volume 342, Issue 1, 3 September 2004, Pages 307-320
Journal home page for Journal of Molecular Biology

Predicting Metal-binding Site Residues in Low-resolution Structural Models

https://doi.org/10.1016/j.jmb.2004.07.019Get rights and content

The accurate prediction of the biochemical function of a protein is becoming increasingly important, given the unprecedented growth of both structural and sequence databanks. Consequently, computational methods are required to analyse such data in an automated manner to ensure genomes are annotated accurately. Protein structure prediction methods, for example, are capable of generating approximate structural models on a genome-wide scale. However, the detection of functionally important regions in such crude models, as well as structural genomics targets, remains an extremely important problem. The method described in the current study, MetSite, represents a fully automatic approach for the detection of metal-binding residue clusters applicable to protein models of moderate quality. The method involves using sequence profile information in combination with approximate structural data. Several neural network classifiers are shown to be able to distinguish metal sites from non-sites with a mean accuracy of 94.5%.

The method was demonstrated to identify metal-binding sites correctly in LiveBench targets where no obvious metal-binding sequence motifs were detectable using InterPro. Accurate detection of metal sites was shown to be feasible for low-resolution predicted structures generated using mGenTHREADER where no side-chain information was available. High-scoring predictions were observed for a recently solved hypothetical protein from Haemophilus influenzae, indicating a putative metal-binding site.

Introduction

The accurate prediction of biological function on a genome-wide scale promises wide-ranging benefits in understanding complex biological processes. Such understanding will be a key stepping-stone in the development of techniques and pharmaceuticals to target genes associated with disease and their products. The rapid growth of the Protein Data Bank (PDB)1 highlights the challenges ahead. Gene products from different species may exhibit similar biological function, but show little or no sequence similarity due to convergent evolution. Structural classification of protein domains such as CATH,2 SCOP3 and FSSP4 reveals that members of the same structural family can span different functional classes. Furthermore, key active sites may be conserved despite there being little overall structural and sequence similarity. It is therefore clear that analysis of functional regions will allow the development of more reliable genome annotation and enhance our knowledge of the biological role of proteins at a cellular level.

Historically, structural insights have provided the most detailed information on biological function, highlighting, for example a specific catalytic mechanism or interactions with other molecules. An in-depth analysis of key functional regions such as those in enzyme active sites, metal-binding sites, and ligand-binding clefts, as well as interacting regions between proteins is likely to add significantly to the repertoire of tools currently available.

Several groups have developed atomic-level methods for analyzing site regions.5, 6, 7, 8 The TESS method6 generates templates for locating geometric patterns in atoms occupying the site regions. This requires the accurate placement of specific side-chain atoms for sensitive site recognition and may not be suitable for model proteins. Fetrow and Skolnick have developed the “Fuzzy Functional Form” representation of a functional region7 combining information from the literature with sequence and structural analysis. This method can be applied to lower-resolution structures; however, the site under investigation must have been fully characterized and information must be retrieved manually from several sources. Bagley and Altman have developed the FEATURE9 method to characterize well-defined functional sites on the basis of statistical descriptors derived from a set of site and non-site data. FEATURE uses the exact placement of side-chain atoms as well as incorporating secondary structural information but does not include directly conservation information for site residues. The method was applied to locate calcium-binding sites in a set of model structures10 and was shown to require high-resolution placement of atoms specific to the site region.

Sequence searching tools have now become routine in initial investigations of new protein and DNA sequences. Pairwise comparison methods such as BLAST are generally effective only where sequence identity is at least 30%. PSI-BLAST11 improves sensitivity by using sequence profiles and an iterative search strategy. Sequence-signature based methods such as PROSITE,12 PRINTS,13 Pfam,14 and BLOCKS15 search for sequence patterns within a query sequence. Generally, the sequence motif is specific to a functional family and can be used to infer functional information. These resources have been combined in the InterPro database, providing access to over 3000 entries.16 However, each of these methods has limitations. PROSITE regular expressions are effective for short motifs but the method fails in identifying members of highly divergent super-families. In contrast, the PRINTS fingerprints are derived from multiple sequence alignments and are particularly suitable for sub-family distinctions but fail at short motif recognition. The hidden Markov models used by Pfam provide a sensitive tool for identifying highly divergent members of super-families but may be less appropriate for sub-family predictions.

Residues that are not local in sequence but local in structure may form site regions. Sequence-based approaches cannot encode directly the 3D spatial organization of functional residues or the atoms responsible for biochemical action in the folded protein. Methods that combine sequence information with structural data offer a powerful approach for determining important functional locations. Rinaldis et al. have mapped sequence profile to surface structure to identify similarities in site regions of SH2 and SH3 domains as well as P-loop nucleotide-binding pockets.17, 18

Protein structures crystallized in the absence of small-molecule substrates or metal ions highlight an alternative need to identify such functional regions automatically. Sites may be occupied by molecules found in buffering solutions, such as SO42, thereby preventing binding of other prosthetic groups. Structural models provide an even harder challenge, prosthetic group binding for such cases will obviously need to be predicted computationally. The correct placement of side-chain atoms is rare, even for very good structural models. Methods capable of locating and classifying sites in predicted structures are therefore likely to improve the quality of genomic fold recognition efforts. Detailed analysis has been presented on the specific atomic geometry of metal sites in proteins. Karlin et al. undertook a comprehensive survey of residue and atomic preferences of metal ion ligation.19 Metal ion binding was investigated by Gregory et al.,20 by developing hydrophobicity contrast measures, again using specific atomic placement on a limited dataset.

The rapid growth of the PDB and metal-containing structures alongside the explosion of sequence information allows a new opportunity to characterize functional sites.

Here, we present a novel approach using artificial neural networks (ANN) to predict six commonly occurring metal ion sites: Ca2+, Cu2+, Fe3+, Mg2+, Mn2+ and Zn2. The method is designed to identify residues forming the metal-binding site in super-families by combining sequence profile and structural information. The motivation of the study has been the development of functional site predictors where only moderate-quality structural information is available. Metal-binding site prediction was benchmarked for a set of newly released crystal structures from the LiveBench project.21 Site detection was shown to be effective in structural models of these targets. We report a putative metal-binding site predicted in a structural genomics target with unknown function.

Section snippets

Datasets

The training set was constructed by taking all protein chains interacting with the specified metal ions from the PDB and clustering at a 25% sequence identity; this resulted in 1018 sequence clusters. For the purposes of cross-validation, these chains were then grouped into 364 distinct SCOP super-families. The numbers of PDB chains, super-families and metal sites in the dataset are summarized in Table 1.

Feature analysis

In order to determine the key features that allow effective discrimination of metal sites

Discussion

We have developed MetSite, a set of artificial neural network classifiers that have been optimized to locate metal-binding regions in protein structures and were shown to predict metal sites in modelled protein structures from LiveBench accurately. Position-specific scoring matrices for metal-binding residues, as well as residues forming the second coordination shell interactions, were shown to be sufficient for discriminating metal ion sites from non-sites. Secondary structure, solvent

Materials and Methods

The PSI-BLAST score matrices were derived by performing three iterations of PSI-BLAST against a non-redundant database for all the unique chains across all datasets. We defined metal site seed residues as those residues with main-chain atoms within 7 Å of a metal ion, N closest neighbouring residues to these seeds were marked as seed neighbours (Figure 5).

For each marked residue, several features were calculated. These included the 20 scores taken from the PSI-BLAST PSSM, secondary structure

Acknowledgements

We thank Tim Ebbels, Sundeep Singh Deol and Gurpreet Singh Nagra for helpful discussions. This work was sponsored by the Medical Research Council (to J.S.S. and J.J.W.).

References (33)

  • L. Holm et al.

    Mapping the protein universe

    Science

    (1996)
  • A.C. Wallace et al.

    TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites

    Protein Sci.

    (1996)
  • C.B. Bagley et al.

    Characterizing the microenviroment surrounding protein site

    Protein Sci.

    (1995)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucl. Acids Res.

    (1997)
  • K. Hofmann et al.

    The Prosite Database, its status in 1999

    Nucl. Acids Res.

    (1999)
  • T.K. Attwood et al.

    PRINTS-S: the database formerly known as PRINTS

    Nucl. Acids Res.

    (2000)
  • Cited by (115)

    • MIonSite: Ligand-specific prediction of metal ion-binding sites via enhanced AdaBoost algorithm with protein sequence information

      2019, Analytical Biochemistry
      Citation Excerpt :

      To our best knowledge, the MetSite [14], CHED [15], SeqCHED [16], MetalDetector [17–19], MIB [20,21], TargetS [11], IonSeq, and IonCom [12] are eight of the best ligand-specific metal ion-binding site predictors. In MetSite [14], the sequence evolutionary profiles (Position Specific Scoring Matrix (PSSM)) and structural information are fed to artificial neural networks for predicting binding sites of six commonly occurring metal ions, i.e., Zn2+, Ca2+, Mg2+, Mn2+, Fe3+, and Cu2+. In CHED [15], a geometric search scheme, which takes the structural rearrangements upon metal ion-binding into account, is designed to predict metal ion-binding sites on apo protein structures based on the knowledge that four specific amino acids (i.e., Cys, His, Asp, and Glu) preferentially bind transition metals [22,23].

    • SCAMPER: Accurate Type-Specific Prediction of Calcium-Binding Residues Using Sequence-Derived Features

      2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    • MetaLLM: Residue-Wise Metal Ion Prediction Using Deep Transformer Model

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text