Journal of Molecular Biology
Volume 428, Issue 4, 22 February 2016, Pages 671-678
Journal home page for Journal of Molecular Biology

Databases/Web Servers
AlloRep: A Repository of Sequence, Structural and Mutagenesis Data for the LacI/GalR Transcription Regulators

https://doi.org/10.1016/j.jmb.2015.09.015Get rights and content

Highlights

  • AlloRep compiles sequence, mutagenesis, and structural data for LacI/GalR proteins.

  • Alignments for > 3000 sequences are grouped by subfamily and sampled in the whole family alignment.

  • AlloRep includes detailed phenotypic and biochemical data on almost 6000 variants.

  • Structural data for 65 proteins are available as residue contact networks.

  • A predicted allosteric position was validated by altering a synthetic repressor.

Abstract

Protein families evolve functional variation by accumulating point mutations at functionally important amino acid positions. Homologs in the LacI/GalR family of transcription regulators have evolved to bind diverse DNA sequences and allosteric regulatory molecules. In addition to playing key roles in bacterial metabolism, these proteins have been widely used as a model family for benchmarking structural and functional prediction algorithms. We have collected manually curated sequence alignments for > 3000 sequences, in vivo phenotypic and biochemical data for > 5750 LacI/GalR mutational variants, and noncovalent residue contact networks for 65 LacI/GalR homolog structures. Using this rich data resource, we compared the noncovalent residue contact networks of the LacI/GalR subfamilies to design and experimentally validate an allosteric mutant of a synthetic LacI/GalR repressor for use in biotechnology. The AlloRep database (freely available at www.AlloRep.org) is a key resource for future evolutionary studies of LacI/GalR homologs and for benchmarking computational predictions of functional change.

Introduction

Sequence- and structure-based comparisons of protein homologs have been frequently used to predict amino acids critical to function. With advances in high-throughput sequence and structure determination, the amount of data available has exploded. To translate these data into meaningful information, myriad computational tools have been developed (i) to detect patterns of amino acid change and (ii) to make predictions about homolog function and mutational outcomes. Development and validation of these programs requires experimental datasets against which to test predictions.

One commonly used (e.g., Refs. [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]) dataset comprises in vivo characterization of ~ 4100 mutational variants of the lactose repressor protein (LacI) [13], [14], [15], [16]. In addition, scores of mutational variants for LacI and its paralogs have been the subject of detailed biochemical and expanded phenotypic studies over the last three decades. However, these additional experimental results have been under-utilized by the computational community due to the challenge of curating the relevant information scattered throughout the literature. Nevertheless, these studies provide in-depth insights that would be extremely valuable for assessing computational predictions. Further, structures for numerous LacI/GalR homologs have become available, mainly through the Protein Structure Initiative [17], [18].

Here, we present AlloRep, a repository of published experimental information for homologs of the LacI/GalR family. AlloRep contains (i) manually curated sequence alignments for > 3100 sequences, (ii) experimental results for > 5750 LacI/GalR mutational variants, and (iii) residue–residue contact networks that were derived from 65 crystallographic structures available for full-length homologs and/or their regulatory domains of 17 LacI/GalR subfamilies (Fig. 1). This database can be queried using MySQL. Information contained in the AlloRep database complements information about predicted regulons for > 1300 LacI/GalR homologs, which was recently added to the RegPrecise database [19].

The data in AlloRep also have important applications in protein design: These data can be used to test robustness of protein engineering approaches and to hypothesize novel ideas for engineering synthetic transcription repressors. As proof of principle, we used AlloRep to merge structural, mutational, and sequence data to identify a position that can be mutated to alter allosteric regulation. Most LacI/GalR homologs are allosterically regulated: the DNA binding domains of the apoproteins bind to their cognate DNA sequences with high affinity, and DNA binding is modulated when a distant site on the regulatory domain is occupied by a small molecule effector (or in some cases, a heteroprotein). The LacI/GalR paralogs have evolved specificities for different DNA sequences and allosteric effectors [20]. Although domain recombination shows that the allosteric mechanism may largely be the same, the magnitude and direction of allosteric response can be modulated [20], [21]. In general, predicting the locations of allosteric positions has been challenging. Our prediction was successfully tested in a synthetic, chimeric repressor that was previously constructed from LacI and the cellobiose repressor (CelR).

Section snippets

Overview of the AlloRep database

The AlloRep database comprises 14 tables and can be queried using MySQL. Example queries and a database scheme are supplied in the accompanying “Data in Brief” publication [22]. A key advantage of AlloRep is that all entries have been mapped to the analogous position of a single homolog—the full-length Escherichia coli LacI protein. This is a powerful way to compare different homologs, as well as different structural conformations of the same protein. This mapping allows a single query to

Conclusion

The AlloRep database organizes available sequence, structural, and experimental data for the LacI/GalR protein family. This dataset will be useful for the development and validation of computational analyses of protein families. We are committed to the continued integration of mutagenesis, structural, and sequence information as they become available for LacI/GalR homologs. We invite the scientific community to send their mutagenesis data to AlloRep so that this experimental resource remains

Sequence retrieval and alignments

The sequence identity boundaries of the new LacI/GalR subfamilies were defined as described in Ref. [26]. For the subfamilies represented by the new PSI PDB structures, a structure-based reference alignment was constructed with PROMALS3D [46] and integrated into the whole family alignment with the program MARS-Prot3 [25]. For all new homologs, subfamily alignments were constructed using MUSCLE [47] and representative sequences were integrated into the whole family alignment with MARS-Prot.

Contact maps

Acknowledgments

This work was supported by the Fundação para a Ciência e Tecnologia grant SFRH/BPD/73058/2010 (F.L.S.), the National Institutes of Health grant GM 079423 (L.S.K.), the University of Kansas Medical Center Biomedical Research Training Program (D.J.P.), the joint National Science Foundation/National Institute of General Medical Sciences Mathematical Biology Program R01GM104974 (M.R.B.), the Robert A. Welch Foundation grant C-1729 (M.R.B.), and private funds. We thank Tina Perica for many

References (50)

  • J. Pei et al.

    Prediction of functional specificity determinants from protein sequences using log-likelihood ratios

    Bioinformatics

    (2006)
  • K. Bharatham et al.

    Determinants, discriminants, conserved residues—A heuristic approach to detection of functional divergence in protein families

    PLoS ONE

    (2011)
  • P.V. Mazin et al.

    An automated stochastic approach to the identification of the protein specificity determinants and functional subfamilies

    Algorithms Mol. Biol.

    (2010)
  • N.J. Marini et al.

    The use of orthologous sequences to predict the impact of amino acid substitutions on protein function

    PLoS Genet.

    (2010)
  • W. Lee et al.

    Bi-directional SIFT predicts a subset of activating mutations

    PLoS ONE

    (2009)
  • K. Ye et al.

    Tracing evolutionary pressure

    Bioinformatics

    (2008)
  • G.M. Cooper et al.

    Qualifying the relationship between sequence conservation and molecular function

    Genome Res.

    (2008)
  • K. Ye et al.

    Multi-RELIEF: A method to recognize specificity determining residues from multiple sequence alignments using a machine-learning approach for feature weighting

    Bioinformatics

    (2008)
  • C.J. Needham et al.

    Predicting the effect of missense mutations on protein function: Analysis with Bayesian networks

    BMC Bioinformatics

    (2006)
  • E.A. Stone et al.

    Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity

    Genome Res.

    (2005)
  • V.G. Krishnan et al.

    A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

    Bioinformatics

    (2003)
  • P.C. Ng et al.

    Predicting deleterious amino acid substitutions

    Genome Res.

    (2001)
  • D. Lee et al.

    1,000 structures and more from the MCSG

    BMC Struct. Biol.

    (2011)
  • K. Khafizov et al.

    Trends in structural coverage of the protein universe and the impact of the Protein Structure Initiative

    Proc. Natl. Acad. Sci. U. S. A.

    (2014)
  • D.A. Ravcheev et al.

    Comparative genomics and evolution of regulons of the LacI-family transcription factors

    Front. Microbiol.

    (2014)
  • Cited by (0)

    7

    Present address: D. J. Parente and J. A. Hessman, University of Kansas School of Medicine, Kansas City, KS 66160, USA.

    View full text