Low-N protein engineering with data-efficient deep learning

Biswas, Surojit; Khimulya, Grigory; Alley, Ethan C.; Esvelt, Kevin M.; Church, George M.

doi:10.1038/s41592-021-01100-y

Article
Published: 07 April 2021

Low-N protein engineering with data-efficient deep learning

Nature Methods volume 18, pages 389–396 (2021)Cite this article

23k Accesses
151 Citations
117 Altmetric
Metrics details

Subjects

Abstract

Protein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two dissimilar proteins, GFP from Aequorea victoria (avGFP) and E. coli strain TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous high-throughput efforts. By distilling information from natural protein sequence landscapes, our model learns a latent representation of ‘unnaturalness’, which helps to guide search away from nonfunctional sequence neighborhoods. Subsequent low-N supervision then identifies improvements to the activity of interest. In sum, our approach enables efficient use of resource-intensive high-fidelity assays without sacrificing throughput, and helps to accelerate engineered proteins into the fermenter, field and clinic.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: UniRep-guided in silico directed evolution for low-N protein engineering.**

**Fig. 2: eUniRep enables low-N engineering of avGFP.**

**Fig. 3: eUniRep enables low-N engineering of the enzyme TEM-1 β-lactamase using only single mutants as training data.**

**Fig. 4: eUniRep designs are structurally non-trivial and require both unsupervised training and low-N supervised training to discover >WT variants.**

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Roman Sarrazin-Gendron, Parham Ghasemloo Gheidari, … Jérôme Waldispühl

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

Data availability

Data required to reproduce all analyses in this work are provided or can be found at https://github.com/churchlab/low-N-protein-engineering. All referenced PDB structures were obtained from https://www.rcsb.org/. The Sarkisyan dataset was obtained from https://figshare.com/articles/dataset/Local_fitness_landscape_of_the_green_fluorescent_protein/3102154.

Code availability

Code for UniRep model training and inference with trained weights along with links to all necessary data is available at https://github.com/churchlab/UniRep. Code required to reproduce all analyses in this work is provided at https://github.com/churchlab/low-N-protein-engineering.

References

Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).
Article CAS Google Scholar
Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet. 16, 379–394 (2015).
Article CAS Google Scholar
Lutz, S. & Patrick, W. M. Novel methods for directed evolution of enzymes: quality, not quantity. Curr. Opin. Biotechnol. 15, 291–297 (2004).
Goldsmith, M. & Tawfik, D. S. Directed enzyme evolution: beyond the low-hanging fruit. Curr. Opin. Struct. Biol. 22, 406–412 (2012).
Zhao, H. & Arnold, F. H. Combinatorial protein design: strategies for screening protein libraries. Curr. Opin. Struct. Biol. 7, 480–485 (1997).
Article CAS Google Scholar
You, L. & Arnold, F. H. Directed evolution of subtilisin E in Bacillus subtilis to enhance total activity in aqueous dimethylformamide. Protein Eng. 9, 77–83 (1996).
Article CAS Google Scholar
Lagassé, H. A. D. et al. Recent advances in (therapeutic protein) drug development. F1000Res. 6, 113 (2017).
Article Google Scholar
Marshall, S. A., Lazar, G. A., Chirino, A. J. & Desjarlais, J. R. Rational design and engineering of therapeutic proteins. Drug Discov. Today 8, 212–221 (2003).
Article CAS Google Scholar
Rao, A. G. The outlook for protein engineering in crop improvement. Plant Physiol. 147, 6–12 (2008).
Article CAS Google Scholar
Schmid, A. et al. Industrial biocatalysis today and tomorrow. Nature 409, 258–268 (2001).
Article CAS Google Scholar
Sheldon, R. A. & Pereira, P. C. Biocatalysis engineering: the big picture. Chem. Soc. Rev. 46, 2678–2691 (2017).
Article CAS Google Scholar
Mullard, A. Better screening and disease models needed. Nat. Rev. Drug Discov. 15, 751–769 (2016).
Article Google Scholar
Scannell, J. W. & Bosley, J. When quality beats quantity: decision theory, drug discovery, and the reproducibility crisis. PLoS ONE 11, e0147215 (2016).
Article Google Scholar
Hughes, J. P., Rees, S., Kalindjian, S. B. & Philpott, K. L. Principles of early drug discovery. Br. J. Pharmacol. 162, 1239–1249 (2011).
Article CAS Google Scholar
Scannell, J. W., Blanckley, A., Boldon, H. & Warrington, B. Diagnosing the decline in pharmaceutical R&D efficiency. Nat. Rev. Drug Discov. 11, 191–200 (2012).
Article CAS Google Scholar
Laverty, H. et al. How can we improve our understanding of cardiovascular safety liabilities to develop safer medicines? Br. J. Pharmacol. 163, 675–693 (2011).
Article CAS Google Scholar
Silver, L. L. Challenges of antibacterial discovery. Clin. Microbiol. Rev. 24, 71–109 (2011).
Article CAS Google Scholar
Wu, Z., Jennifer Kan, S. B., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).
Article CAS Google Scholar
Lutz, S. Beyond directed evolution—semi-rational protein engineering and design. Curr. Opin. Biotechnol. 21, 734–743 (2010).
Bedbrook, C. N., Yang, K. K., Rice, A. J., Gradinaru, V. & Arnold, F. H. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS Comput. Biol. 13, e1005786 (2017).
Article Google Scholar
Bedbrook, C. N. et al. Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics. Nat. Methods 16, 1176–1184 (2019).
Article CAS Google Scholar
Romney, D. K., Murciano-Calles, J., Wehrmüller, J. E. & Arnold, F. H. Unlocking reactivity of TrpB: a general biocatalytic platform for synthesis of tryptophan analogues. J. Am. Chem. Soc. 139, 10769–10776 (2017).
Article CAS Google Scholar
Silva, D. A., Yu, S., Ulge, U. Y., Spangler, J. B. & Jude, K. M. De novo design of potent and selective mimics of IL-2 and IL-15. Nature 565, 186–191 (2019).
Marcandalli, J., Fiala, B., Ols, S. & Perotti, M. Induction of potent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratory syncytial virus. Cell 176, 1420–1431 (2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article CAS Google Scholar
Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. In IEEE Intelligent Systems (IEEE, 2009).
Hénaff, O. J. et al. Data-efficient image recognition with contrastive predictive coding. In Proc. 37th Int. Conf. Machine Learning 119, 4182–4192 (2020).
Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 336, 1139–1143 (2019).
Biswas, S. et al. Toward machine-guided design of proteins. Preprint at bioRxiv https://doi.org/10.1101/337154 (2018).
Brookes, D. H., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Preprint at https://arxiv.org/abs/1901.10060 (2019).
Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).
Article Google Scholar
Cadet, F., Fontaine, N., Li, G., Sanchis, J. & Chong, M. N. F. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci. Rep. 8, 16757 (2018).
Saito, Y., Oikawa, M., Nakazawa, H. & Niide, T. Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7, 2014–2022 (2018).
Musdal, Y., Govindarajan, S. & Mannervik, B. Exploring sequence–function space of a poplar glutathione transferase using designed information-rich gene variants. Protein Eng. Des. Sel. 30, 543–549 (2017).
Article CAS Google Scholar
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Article CAS Google Scholar
Liao, J. et al. Engineering proteinase K using machine learning and synthetic genes. BMC Biotechnol. 7, 16 (2007).
Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).
Article CAS Google Scholar
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article CAS Google Scholar
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J. & Schärfe, C. P. I. Mutation effects predicted from sequence co-variation. Nature 35, 128–135 (2017).
Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. Preprint at https://arxiv.org/abs/1712.03346 (2017).
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Preprint at bioRxiv https://doi.org/10.1101/757252 (2019).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article CAS Google Scholar
Ashkenazy, H. & Penn, O. FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res. 40, W580–W584 (2012).
Gumulya, Y. & Gillam, E. M. J. Exploring the past and the future of protein evolution with ancestral sequence reconstruction: the ‘retro’ approach to protein engineering. Biochem. J. 474, 1–19 (2017).
Sternke, M., Tripp, K. W. & Barrick, D. Consensus sequence design as a general strategy to create hyperstable, biologically active proteins. Proc. Natl Acad. Sci. USA 116, 11275–11284 (2019).
Article CAS Google Scholar
Porebski, B. T. & Buckle, A. M. Consensus protein design. Protein Eng. Des. Sel. 29, 245–251 (2016).
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
CAS PubMed Google Scholar
Firnberg, E., Labonte, J. W. & Gray, J. J. A comprehensive, high-resolution map of a gene’s fitness landscape. Mol. Biol. Evol. 31, 1581–1592 (2014).
Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012).
Article CAS Google Scholar
Povolotskaya, I. S. & Kondrashov, F. A. Sequence space and the ongoing expansion of the protein universe. Nature 465, 922–926 (2010).
Article CAS Google Scholar
Schenk, M. F., Szendro, I. G., Salverda, M. L. M., Krug, J. & de Visser, J. A. G. M. Patterns of epistasis between beneficial mutations in an antibiotic resistance gene. Mol. Biol. Evol. 30, 1779–1787 (2013).
Article CAS Google Scholar
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. https://doi.org/10.1038/s42256-021-00310-5 (2021).
Manheim, D. & Garrabrant, S. Categorizing variants of Goodhart’s Law. Preprint at https://arxiv.org/abs/1803.04585 (2018).
Dou, J. et al. De novo design of a fluorescence-activating β barrel. Nature 561, 485–491 (2018).
Lu, P., Min, D., DiMaio, F., Wei, K. Y. & Vahey, M. D. Accurate computational design of multipass transmembrane proteins. Science 359, 1042–1046 (2018).
Bick, M. J. et al. Computational design of environmental sensors for the potent opioid fentanyl. eLife 6, e28909 (2017).
Article Google Scholar
Zhang, R. K., Chen, K., Huang, X. & Wohlschlager, L. Enzymatic assembly of carbon–carbon bonds via iron-catalysed sp³ C–H functionalization. Nature 565, 67–72 (2019).
Bornscheuer, U. T. & Pohl, M. Improved biocatalysts by directed evolution and rational protein design. Curr. Opin. Chem. Biol. 5, 137–134 (2001).
Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
Article CAS Google Scholar
Chen, R. Enzyme engineering: rational redesign versus directed evolution. Trends Biotechnol. 19, 13–14 (2001).
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Article CAS Google Scholar
Pédelacq, J.-D., Cabantous, S., Tran, T., Terwilliger, T. C. & Waldo, G. S. Engineering and characterization of a superfolder green fluorescent protein. Nat. Biotechnol. 24, 79–88 (2006).
Article Google Scholar
Dror, A., Shemesh, E. & Dayan, N. Protein engineering by random mutagenesis and structure-guided consensus of Geobacillus stearothermophilus lipase T6 for enhanced stability in methanol. Appl. Environ. Microbiol. 80, 1515–1527 (2014).
Rocklin, G. J., Chidyausiku, T. M., Goreshnik, I. & Ford, A. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Wannier, T. M. et al. Monomerization of far-red fluorescent proteins. Proc. Natl Acad. Sci. USA 115, E11294–E11301 (2018).
Article CAS Google Scholar
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T. & Le, Q. V. Unsupervised data augmentation for consistency training. Preprint at https://arxiv.org/abs/1904.12848 (2019).
Berthelot, D. et al. MixMatch: a holistic approach to semi-supervised learning. Preprint at https://arxiv.org/abs/1905.02249 (2019).
Radford, A., Jozefowicz, R. & Sutskever, I. Learning to generate reviews and discovering sentiment. Preprint at https://arxiv.org/abs/1704.01444 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Potter, S. C., Luciani, A., Eddy, S. R. & Park, Y. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).
Caruana, R., Lawrence, S. & Giles, C. L. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems (NIPS, 2001).
Maclaurin, D., Duvenaud, D. & Adams, R. P. Early stopping is nonparametric variational inference. Preprint at https://arxiv.org/abs/1504.01344 (2015).
Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).
Lambert, T. J. FPbase: a community-editable fluorescent protein database. Nat. Methods 16, 277–278 (2019).
Arnold, F. H. & Georgiou, G. (eds) Directed Evolution Library Creation: Methods and Protocols. (Humana Press, 2010).
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).
Article Google Scholar
Le, Q. & Mikolov, T. Distributed representations of sentences and documents. In Proc. 31st Int. Conf. Machine Learning 32, 1188–1196 (PMLR, 2014).
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat. 32, 407–499 (2004).
Sohka, T. et al. An externally tunable bacterial band-pass filter. Proc. Natl Acad. Sci. USA 106, 10135–10140 (2009).
Oberacker, P. et al. Bio-On-Magnetic-Beads (BOMB): open platform for high-throughput nucleic acid extraction and manipulation. PLoS Biol. 17, e3000107 (2019).
Article Google Scholar
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).
Article CAS Google Scholar
Magoč, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963 (2011).
Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160, 882–892 (2015).
AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics 20, 311 (2019).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Chen, H. & Zhou, H. X. Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res. 33, 3193–3199 (2005).

Download references

Acknowledgements

We thank M. AlQuraishi, C. Bakerlee, A. Chiappino-Pepe, A. Eremina, K. Fish, S. Gosai, X. Guo, E. Kelsic, S. Kosuri, P. Ogden, S. Sinai, M. Schubert, A. Taylor-Weiner, D. Thompson and A. Tucker for feedback on earlier drafts of this manuscript. We thank members of the Esvelt and Church laboratories for valuable discussion. S.B. was supported by an NSF GRFP Fellowship under grant number DGE1745303. G.K. was supported by a grant from the Center for Effective Altruism. E.C.A. was supported by a scholarship from the Open Philanthropy Project. This material is based upon work supported by the US Department of Energy, Office of Science under award number DE‐FG02‐02ER63445. Computational resources were, in part, generously provided by the AWS Cloud Credits for Research Program and Lambda Labs, Inc.

Author information

These authors contributed equally: Surojit Biswas, Grigory Khimulya, Ethan C. Alley.

Authors and Affiliations

Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA
Surojit Biswas & George M. Church
Nabla Bio, Inc., Boston, MA, USA
Surojit Biswas
Telis Bioscience Inc., Boston, MA, USA
Grigory Khimulya
MIT Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
Ethan C. Alley & Kevin M. Esvelt
Department of Genetics, Harvard Medical School, Boston, MA, USA
George M. Church

Authors

Surojit Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Grigory Khimulya
View author publications
You can also search for this author in PubMed Google Scholar
Ethan C. Alley
View author publications
You can also search for this author in PubMed Google Scholar
Kevin M. Esvelt
View author publications
You can also search for this author in PubMed Google Scholar
George M. Church
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.B., G.K. and E.C.A. conceived the study. S.B. performed wet-lab experiments and managed data. S.B., G.K. and E.C.A. performed machine learning modeling and data analyses. K.M.E. and G.M.C. supervised the project. S.B., G.K. and E.C.A. wrote the manuscript with help from all authors.

Corresponding author

Correspondence to George M. Church.

Ethics declarations

Competing interests

A full list of G.M.C.’s technology transfer, advisory roles and funding sources can be found on the laboratory’s website at http://arep.med.harvard.edu/gmc/tech.html. S.B. is employed by and holds equity in Nabla Bio, Inc. G.K. is employed by and holds equity in Telis Bioscience Inc. E.C.A. and K.M.E. declare no competing interests.

Additional information

Peer review information Nature Methods thanks Gabriel Rocklin, Guillaume Lamoureux, and the other, anonymous reviewer, for their contribution to the peer review of this work. Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–18

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biswas, S., Khimulya, G., Alley, E.C. et al. Low-N protein engineering with data-efficient deep learning. Nat Methods 18, 389–396 (2021). https://doi.org/10.1038/s41592-021-01100-y

Download citation

Received: 21 August 2020
Accepted: 22 February 2021
Published: 07 April 2021
Issue Date: April 2021
DOI: https://doi.org/10.1038/s41592-021-01100-y

This article is cited by

Computational redesign of a hydrolase for nearly complete PET depolymerization at industrially relevant high-solids loading
- Yinglu Cui
- Yanchun Chen
- Bian Wu
Nature Communications (2024)
Assessing antibody and nanobody nativeness for hit selection and humanization with AbNatiV
- Aubin Ramon
- Montader Ali
- Pietro Sormanni
Nature Machine Intelligence (2024)
Deep generative design of RNA family sequences
- Shunsuke Sumi
- Michiaki Hamada
- Hirohide Saito
Nature Methods (2024)
Efficient evolution of human antibodies from general protein language models
- Brian L. Hie
- Varun R. Shanker
- Peter S. Kim
Nature Biotechnology (2024)
Machine learning for functional protein design
- Pascal Notin
- Nathan Rollins
- Debora Marks
Nature Biotechnology (2024)

Low-N protein engineering with data-efficient deep learning

Subjects

Abstract

Access options

Similar content being viewed by others

Improving microbial phylogeny with citizen science within a mass-market video game

Highly accurate protein structure prediction with AlphaFold

An open source knowledge graph ecosystem for the life sciences

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

Computational redesign of a hydrolase for nearly complete PET depolymerization at industrially relevant high-solids loading

Assessing antibody and nanobody nativeness for hit selection and humanization with AbNatiV

Deep generative design of RNA family sequences

Efficient evolution of human antibodies from general protein language models

Machine learning for functional protein design

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links