Journal of Molecular Biology
Volume 425, Issue 8, 26 April 2013, Pages 1274-1286
Journal home page for Journal of Molecular Biology

Proteins and Domains Vary in Their Tolerance of Non-Synonymous Single Nucleotide Polymorphisms (nsSNPs)

https://doi.org/10.1016/j.jmb.2013.01.026Get rights and content

Abstract

The widespread application of whole-genome sequencing is identifying numerous non-synonymous single nucleotide polymorphisms (nsSNPs), many of which are associated with disease. We analyzed nsSNPs from Humsavar and the 1000 Genomes Project to investigate why some proteins and domains are more tolerant of mutations than others. We identified 311 proteins and 112 Pfam families, corresponding to 2910 domains, as disease susceptible and 32 proteins and 67 Pfam families (10,783 domains) as disease resistant based on the relative numbers of disease-associated and neutral polymorphisms. Proteins with no significant difference from expected numbers of disease and polymorphism nsSNPs are classified as other. This classification takes into account the phenotypes of all known mutations in the protein or domain rather than simply classifying based on the presence or absence of disease nsSNPs. Of the two hypotheses suggested, our results support the model that disease-resistant domains and proteins are more able to tolerate mutations rather than having more lethal mutations that are not observed. Disease-resistant proteins and domains show significantly higher mutation rates and lower sequence conservation than disease-susceptible proteins and domains. Disease-susceptible proteins are more likely to be encoded by essential genes, are more central in protein–protein interaction networks and are less likely to contain loss-of-function mutations in healthy individuals. We use this classification for nsSNP phenotype prediction, predicting nsSNPs in disease-susceptible domains to be disease and those in disease-resistant domains to be polymorphism. In this way, we achieve higher accuracy than SIFT, a state-of-the-art algorithm.

Graphical Abstract

Highlights

► Many disease-associated nsSNPs and the proteins containing them have been studied. ► We use the numbers of disease and neutral nsSNPs to classify proteins and domains. ► Tolerance of proteins and domains for mutations is related to conservation. ► It is also related to interaction networks and function. ► This classification is useful for predicting phenotypes of nsSNPs.

Introduction

As the cost and time required for DNA sequencing fall drastically,1 many mutations are being discovered that could be used to help determine genetic influences on disease and thereby personalize medicine.[2], [3] A recent study4 found over 13,000 exonic single nucleotide polymorphisms (SNPs) per person, with around 58% of these leading to a change in the protein sequence. These non-synonymous single nucleotide polymorphisms (nsSNPs) have been the focus of considerable attention for disease studies because the change in sequence gives a number of possible functional impacts. However, not all mutations are functionally important and different proteins and domains differ in how well they tolerate mutations. Several[5], [6], [7], [8] studies have investigated the properties of disease-causing proteins, but far fewer[9], [10], [11] consider the location of disease-causing mutations from a domain perspective. Here we map disease-associated nsSNPs to proteins and domains and investigate the properties of those proteins and domains with significantly high or low levels of disease-associated nsSNPs.

Not all nsSNPs have functional consequences, but as it is not feasible to experimentally determine the effects of all mutations, a number of computational methods, such as SIFT,12 PolyPhen13 and SNAP,14 have been developed to predict the phenotypic impact of nsSNPs based on structure, sequence and/or known functional residues.

Many disease-associated nsSNPs lead to a decrease in protein stability. This is exploited by PoPMuSiC15 to predict the effects of nsSNPs based on estimated changes to protein stability. nsSNPs can also impact upon a protein's interactions with other proteins. Two recent studies[16], [17] found that disease-causing nsSNPs are more commonly found at protein–protein interaction (PPI) interfaces than polymorphic nsSNPs, suggesting that, in some cases, disease is a result of impaired interaction.

Studies have examined which features of proteins make them more or less likely to be related to disease in the context of PPI networks.[18], [19], [20] A number of graph theoretic calculations can be used to find various centralities in order to determine which proteins are most important in the network. These include degree, betweenness and coreness centralities (defined in Materials and Methods). Proteins with high degree are termed “hubs” and include scaffold proteins and other proteins involved in multiple processes.

It has been suggested that disease proteins are more likely to act as “bottlenecks” in the PPI network, measured as high betweenness, but are no more likely to be hubs than other proteins.21 Proteins associated with multiple diseases are more central than other proteins, particularly if the diseases are phenotypically different, implying that the different phenotypes may be caused by effects on different biological processes.19 Goh et al. produced a “human disease network”, with diseases linked to their causative proteins in the OMIM (Online Mendelian Inheritance in Man) Morbid Map.22 They found that essential disease proteins are located centrally in the network, while nonessential disease proteins are found more peripherally.

While most studies have focused on proteins, it is also important to examine in which domains the disease nsSNPs are found. Protein domains are generally easy to identify and often have well-defined functions, and thus, a finer-grained analysis is possible if one examines nsSNPs at a domain level rather than at the protein level. Pfam23 is a database of Hidden Markov Models, each corresponding to a protein family. Protein sequences can be compared to these models to find areas of significant sequence homology and thereby identify domains. Wang et al. showed that nsSNPs located at different interaction interfaces or in different domains are more likely to cause different diseases, implying that separating proteins into domains is likely to be useful to further understand the molecular basis for disease.16

nsSNPs introduce a change in protein sequence, which may affect the identification of a domain by Pfam or other sequence-based methods. Liu and Tozeren11 and Clifford et al.10 found that disease-causing nsSNPs lead to greater disruption of domains, measured by changes in domain annotations using ScanProsite and Pfam, respectively. However, in both of these studies, the focus is on the properties of the proteins affected by domain changes rather than on the domains themselves. Zhang et al. used known domain-disease associations and a domain–domain interaction network to produce a Bayesian predictor to associate domains with diseases and show that domains interacting with one another are likely to cause similar diseases.24 They also used their method to help in disease gene prediction by restricting their search to genes encoding proteins containing the domains associated with the disease of interest.

Generally these studies separate proteins or domains into two groups based on the presence or absence of disease mutations. This separation is not entirely satisfactory, as it only requires a single mutation in a previously “Non-Disease” protein to cause it to switch to the “Disease” group. Accordingly, here we present a classification of “disease propensity” to identify those proteins or domains that contain significantly more or fewer disease nsSNPs than expected, and we refer to these as disease susceptible and disease resistant, respectively. We have examined what makes a domain or protein more likely to contain disease-causing mutations, finding that some proteins and domains are more able to accommodate nsSNPs than others. By using this classification for nsSNP phenotype prediction, with nsSNPs in disease-susceptible domains predicted to be disease and those in disease-resistant domains predicted to be polymorphism, we achieve higher accuracy than SIFT, a state-of-the-art algorithm.

Section snippets

Identification of disease-resistant and disease-susceptible domains and proteins

To classify proteins and domains, we mapped disease-causing (disease) and neutral (polymorphism) nsSNPs to protein sequences from UniProt (Fig. 1a and b). Pfam23 was then used to identify domains in these sequences and the nsSNPs were mapped to domains (Fig. 1c). If a domain or protein contains N nsSNPs, Nd of which are labeled as disease and Np as polymorphism with Nd > Np, the probability of it containing Nd or more disease nsSNPs can be calculated using a binomial distribution. Equally, if Np > 

Conclusions

We classified proteins and domains as disease susceptible or disease resistant depending on the relative numbers of disease and polymorphism nsSNPs they contain. This classification is more robust than previously used methods of classifying proteins as “Disease” or “Non-Disease” and provides insight into features that affect whether a protein or domain is likely to contain disease mutations.

As well as being a useful classification for proteins and domains, we show that disease propensity could

Materials and Methods

nsSNPs from Humsavar† (downloaded September 21, 2011) were mapped to protein sequences from UniProt. Variants were also downloaded from the 1000 Genomes Project25 FTP (File Transfer Protocol) site (ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) and mapped to the UCSC hg19 Known Genes using ANNOVAR.44 Any 1000 Genomes Project nsSNPs labeled as disease in Humsavar were classed as disease. Any other nsSNPs with a MAF of greater than 0.1 were assumed

Acknowledgements

We would like to thank an anonymous reviewer for their suggestion on the 1000 Genomes Project as an unbiased source of nsSNPs and Jeffrey Tang for helpful discussions about intrinsically disordered proteins. C.M.Y. is supported by a Medical Research Council Ph.D. studentship.

References (50)

  • T.-P. Nguyen et al.

    A quantitative approach to study indirect effects among disease proteins in the human protein interaction network

    BMC Syst. Biol.

    (2010)
  • T. Ideker et al.

    Protein networks in disease

    Genome Res.

    (2008)
  • A. Han et al.

    SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences

    Nucleic Acids Res.

    (2006)
  • R.J. Clifford et al.

    Large-scale analysis of non-synonymous coding region single nucleotide polymorphisms

    Bioinformatics

    (2004)
  • Y. Liu et al.

    Domain altering SNPs in the human proteome and their impact on signaling pathways

    PLoS One

    (2010)
  • P.C. Ng

    SIFT: predicting amino acid changes that affect protein function

    Nucleic Acids Res.

    (2003)
  • V. Ramensky et al.

    Human non-synonymous SNPs: server and survey

    Nucleic Acids Res.

    (2002)
  • Y. Bromberg et al.

    SNAP predicts effect of mutations on protein function

    Bioinformatics

    (2008)
  • Y. Dehouck et al.

    PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality

    BMC Bioinformatics

    (2011)
  • X. Wang et al.

    Three-dimensional reconstruction of protein networks provides insight into human genetic disease

    Nat. Biotechnol.

    (2012)
  • A. David et al.

    Protein–protein interaction sites are hot spots for disease-associated nonsynonymous SNPs

    Hum. Mutat.

    (2012)
  • A.-L. Barabási et al.

    Network medicine: a network-based approach to human disease

    Nat. Rev., Genet.

    (2011)
  • S. Chavali et al.

    Network properties of human disease genes with pleiotropic effects

    BMC Syst. Biol.

    (2010)
  • I. Feldman et al.

    Network properties of genes harboring inherited disease mutations

    Proc. Natl Acad. Sci. USA

    (2008)
  • H. Yu et al.

    The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics

    PLoS Comput. Biol.

    (2007)
  • Cited by (32)

    • Insights into changes in binding affinity caused by disease mutations in protein-protein complexes

      2020, Computers in Biology and Medicine
      Citation Excerpt :

      Sahni et al. [16] observed that about 60% of disease mutations perturbed at least one interaction, and over 25% result in a complete loss of interactions. Disease-associated proteins are likely to be central in interaction networks [17,18]. To gain deep insights, we have undertaken three types of analyses to explore the relationship between disease mutations and binding affinity change in protein-protein complexes.

    • DAMpred: Recognizing Disease-Associated nsSNPs through Bayes-Guided Neural-Network Model Built on Low-Resolution Structure Prediction of Proteins and Protein–Protein Interactions

      2019, Journal of Molecular Biology
      Citation Excerpt :

      Recognition of the disease-associated genome mutations may help understand the mechanisms of the genetic disorders and improve the chance for early diagnosis and treatment of such diseases [1]. While considerable effort has been made along this line, it remains a significant unsolved problem to precisely recognize the disease-causing nsSNPs from dominant neutral mutations (NMs) [2]. Several methods have been developed for computational recognition of the disease-associated mutations (DMs), which can be generally categorized into two groups: statistical and machine-learning methods.

    • Insights into IL-23 biology: From structure to function

      2015, Cytokine and Growth Factor Reviews
      Citation Excerpt :

      It has been estimated that around 58% of the 13,000 exonic non-synonymous SNPs result in an amino acid exchange [84]. Effects of non-synonymous SNPs on protein-protein interactions with regard to disease associations have been investigated during the last years [84]. It was suggested that 10% of disease-associated nsSNPs could affect protein-protein interactions [85].

    • SuSPect: Enhanced prediction of single amino acid variant (SAV) phenotype using network features

      2014, Journal of Molecular Biology
      Citation Excerpt :

      Degree centrality in a PPI network is selected as an important feature. We have previously shown that proteins with significantly more disease-associated than neutral SAVs (disease-susceptible) are positioned centrally in PPI networks [10]. SAVs can affect protein function without leading to disease, for example, if normal cellular function can be carried out even in the complete absence of the protein [24].

    View all citing articles on Scopus
    View full text