Skip to main content

Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Protein Interaction Hot Spots

  • Protocol
  • First Online:
Computational Drug Discovery and Design

Abstract

With machine learning now transforming the sciences, successful prediction of biological structure or activity is mainly limited by the extent and quality of data available for training, the astute choice of features for prediction, and thorough assessment of the robustness of prediction on a variety of new cases. In this chapter, we address these issues while developing and sharing protocols to build a robust dataset and rigorously compare several predictive classifiers using the open-source Python machine learning library, scikit-learn. We show how to evaluate whether enough data has been used for training and whether the classifier has been overfit to training data. The most telling experiment is 500-fold repartitioning of the training and test sets, followed by prediction, which gives a good indication of whether a classifier performs consistently well on different datasets. An intuitive method is used to quantify which features are most important for correct prediction.

The resulting well-trained classifier, hotspotter, can robustly predict the small subset of amino acid residues on the surface of a protein that are energetically most important for binding a protein partner: the interaction hot spots. Hotspotter has been trained and tested here on a curated dataset assembled from 1046 non-redundant alanine scanning mutation sites with experimentally measured change in binding free energy values from 97 different protein complexes; this dataset is available to download. The accessible surface area of the wild-type residue at a given site and its degree of evolutionary conservation proved the most important features to identify hot spots. A variant classifier was trained and validated for proteins where only the amino acid sequence is available, augmented by secondary structure assignment. This version of hotspotter requiring fewer features is almost as robust as the structure-based classifier. Application to the ACE2 (angiotensin converting enzyme 2) receptor, which mediates COVID-19 virus entry into human cells, identified the critical hot spot triad of ACE2 residues at the center of the small interface with the CoV-2 spike protein. Hotspotter results can be used to guide the strategic design of protein interfaces and ligands and also to identify likely interfacial residues for protein:protein docking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nguyen TA, Cigler M, Lang K (2018) Expanding the genetic code to study protein–protein interactions. Angew Chemie Int Ed 57:14350–14361. https://doi.org/10.1002/anie.201805869

    Article  CAS  Google Scholar 

  2. Cuervo NZ, Grandvaux N (2020) Ace2: evidence of role as entry receptor for sars-cov-2 and implications in comorbidities. elife 9:1–25. https://doi.org/10.7554/eLife.61390

    Article  Google Scholar 

  3. Horton N, Lewis M (1992) Calculation of the free energy of association for protein complexes. Protein Sci 1:169–181. https://doi.org/10.1002/pro.5560010117

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Sanchez G (2013) Las instituciones de ciencia y tecnología en los procesos de aprendizaje de la producción agroalimentaria en Argentina. El Sist argentino innovación Inst Empres y redes 14:659–664. https://doi.org/10.1002/prot

    Article  Google Scholar 

  5. Lo Conte L, Chothia C, Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285:2177–2198. https://doi.org/10.1006/jmbi.1998.2439

    Article  CAS  PubMed  Google Scholar 

  6. Moreira IS, Fernandes PA, Ramos MJ (2007) Hot spots—a review of the protein – protein interface determinant amino-acid residues. Proteins 68(4):803–812. https://doi.org/10.1002/prot.21396

    Article  CAS  PubMed  Google Scholar 

  7. Thorn KS, Bogan AA (2001) ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics 17:284–285. https://doi.org/10.1093/bioinformatics/17.3.284

    Article  CAS  PubMed  Google Scholar 

  8. Cukuroglu E, Engin HB, Gursoy A, Keskin O (2014) Hot spots in protein-protein interfaces: towards drug discovery. Prog Biophys Mol Biol 116:165–173. https://doi.org/10.1016/j.pbiomolbio.2014.06.003

    Article  CAS  PubMed  Google Scholar 

  9. Carro L (2018) Protein-protein interactions in bacteria: a promising and challenging avenue towards the discovery of new antibiotics. Beilstein J Org Chem 14:2881–2896. https://doi.org/10.3762/bjoc.14.267

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Huo S, Massova I, Kollman PA (2002) Computational alanine scanning of the 1:1 human growth hormone-receptor complex. J Comput Chem 23:15–27. https://doi.org/10.1002/jcc.1153

    Article  CAS  PubMed  Google Scholar 

  11. Deng L, Guan J, Wei X et al (2013) Boosting prediction performance of protein-protein interaction hot spots by using structural neighborhood properties. J Comput Biol 20:878–891. https://doi.org/10.1007/978-3-642-37195-0_32

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Guerois R, Nielsen JE, Serrano L (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 320:369–387. https://doi.org/10.1016/S0022-2836(02)00442-4

    Article  CAS  PubMed  Google Scholar 

  13. Darnell SJ, LeGault L, Mitchell JC (2008) KFC Server: interactive forecasting of protein interaction hot spots. Nucleic Acids Res 36:265–269. https://doi.org/10.1093/nar/gkn346

    Article  CAS  Google Scholar 

  14. Kortemme T, Baker D (2002) A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci 99:14116–14121. https://doi.org/10.1073/pnas.202485799

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Zhu X, Mitchell JC (2011) KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density, and plasticity features. Proteins Struct Funct Bioinf 79:2671–2683. https://doi.org/10.1002/prot.23094

    Article  CAS  Google Scholar 

  16. Xia J-F, Zhao X-M, Song J, Huang D-S (2010) APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinf 11:174. https://doi.org/10.1186/1471-2105-11-174

    Article  CAS  Google Scholar 

  17. Qiao Y, Xiong Y, Gao H et al (2018) Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinf 19:1–16. https://doi.org/10.1186/s12859-018-2009-5

    Article  CAS  Google Scholar 

  18. Wang H, Liu C, Deng L (2018) Enhanced prediction of hot spots at protein-protein interfaces using extreme gradient boosting. Sci Rep 8:1–13. https://doi.org/10.1038/s41598-018-32511-1

    Article  CAS  Google Scholar 

  19. Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349:255–260. https://doi.org/10.1126/science.aaa8415

    Article  CAS  PubMed  Google Scholar 

  20. Rose PW, Prlić A, Altunkaya A et al (2017) The RCSB Protein Data Bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res 45:D271–D281. https://doi.org/10.1093/nar/gkw1000

    Article  CAS  PubMed  Google Scholar 

  21. Fischer TB, Arunachalam KV, Bailey D et al (2003) The binding inteference database (BID): a compilation of amino acid hot spots in protein interfaces. Bioinformatics 19:1453–1454. https://doi.org/10.1093/bioinformatics/btg163

    Article  CAS  PubMed  Google Scholar 

  22. Moal IH, Fernández-Recio J (2012) SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28:2600–2607. https://doi.org/10.1093/bioinformatics/bts489

    Article  CAS  PubMed  Google Scholar 

  23. Sirin S, Apgar JR, Bennett EM, Keating AE (2016) AB-Bind: antibody binding mutational database for computational affinity predictions. Protein Sci 25:393–409. https://doi.org/10.1002/pro.2829

    Article  CAS  PubMed  Google Scholar 

  24. Chiu ML, Gilliland GL (2016) Engineering antibody therapeutics. Curr Opin Struct Biol 38:163–173. https://doi.org/10.1016/j.sbi.2016.07.012

    Article  CAS  PubMed  Google Scholar 

  25. Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001) Protein flexibility predictions using graph theory. Proteins 165:150–165

    Article  Google Scholar 

  26. Raschka S, Bemister-Buffington J, Kuhn LA (2016) Detecting the native ligand orientation by interfacial rigidity: SiteInterlock. Proteins 84:1888–1901. https://doi.org/10.1002/prot.25172

    Article  CAS  PubMed  Google Scholar 

  27. Sanschagrin PC, Kuhn LA (1998) Cluster analysis of consensus water sites in thrombin and trypsin shows conservation between serine proteases and contributions to ligand specificity. Protein Sci 7:2054–2064. https://doi.org/10.1002/pro.5560071002

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Carugo O (2018) How large B-factors can be in protein crystal structures. BMC Bioinf 19:1–9. https://doi.org/10.1186/s12859-018-2083-8

    Article  CAS  Google Scholar 

  29. Hu J, Li J, Chen N, Zhang X (2016) Conservation of hot regions in protein–protein interaction in evolution. Methods 110:73–80. https://doi.org/10.1016/j.ymeth.2016.06.020

    Article  CAS  PubMed  Google Scholar 

  30. Ashkenazy H, Abadi S, Martz E et al (2016) ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res 44:1–7. https://doi.org/10.1093/nar/gkw408

    Article  CAS  Google Scholar 

  31. Joosten RP, Te Beek TAH, Krieger E et al (2011) A series of PDB related databases for everyday needs. Nucleic Acids Res 39:411–419. https://doi.org/10.1093/nar/gkq1105

    Article  CAS  Google Scholar 

  32. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637

    Article  CAS  PubMed  Google Scholar 

  33. Abraham MJ, Murtola T, Schulz R et al (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2:19–25. https://doi.org/10.1016/j.softx.2015.06.001

    Article  Google Scholar 

  34. Eisenhaber F, Lijnzaad P, Argos P et al (1995) The double cubic lattice method: efficient approaches to numerical integration of surface area and volume and to dot surface contouring of molecular assemblies. J Comput Chem 16:273–284

    Article  CAS  Google Scholar 

  35. Raschka S, Liu YH, Mirjalili V (2022) Machine learning with PyTorch and Scikit-learn. Packt, Birmingham-Mumbai

    Google Scholar 

  36. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  37. Raschka S (2018) MLxtend: providing machine learning and data science utilities and extensions to Python’s scientific computing stack. J Open Source Softw 3:638. https://doi.org/10.21105/joss.00638

    Article  Google Scholar 

  38. Chan KK, Dorosky D, Sharma P et al (2020) Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2. Science 369:1261–1265. https://doi.org/10.1126/SCIENCE.ABC0870

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Lan J, Ge J, Yu J et al (2020) Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature 581:215–220. https://doi.org/10.1038/s41586-020-2180-5

    Article  CAS  PubMed  Google Scholar 

  40. Berman HM, Battistuz T, Bhat TN et al (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res 30:245–248

    Google Scholar 

  41. Tunyasuvunakool K, Adler J, Wu Z et al (2021) Highly accurate protein structure prediction for the human proteome. Nature 596:590–596. https://doi.org/10.1038/s41586-021-03828-1

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

We sincerely thank Prof. Jerry Tsai for providing a copy of his Binding Interface Database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leslie A. Kuhn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Chen, J., Kuhn, L.A., Raschka, S. (2024). Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Protein Interaction Hot Spots. In: Gore, M., Jagtap, U.B. (eds) Computational Drug Discovery and Design. Methods in Molecular Biology, vol 2714. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3441-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-3441-7_14

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-3440-0

  • Online ISBN: 978-1-0716-3441-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics