Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Protein Interaction Hot Spots

Chen, Jiaxing; Kuhn, Leslie A.; Raschka, Sebastian

doi:10.1007/978-1-0716-3441-7_14

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2714))

1380 Accesses
1 Altmetric

Abstract

With machine learning now transforming the sciences, successful prediction of biological structure or activity is mainly limited by the extent and quality of data available for training, the astute choice of features for prediction, and thorough assessment of the robustness of prediction on a variety of new cases. In this chapter, we address these issues while developing and sharing protocols to build a robust dataset and rigorously compare several predictive classifiers using the open-source Python machine learning library, scikit-learn. We show how to evaluate whether enough data has been used for training and whether the classifier has been overfit to training data. The most telling experiment is 500-fold repartitioning of the training and test sets, followed by prediction, which gives a good indication of whether a classifier performs consistently well on different datasets. An intuitive method is used to quantify which features are most important for correct prediction.

The resulting well-trained classifier, hotspotter, can robustly predict the small subset of amino acid residues on the surface of a protein that are energetically most important for binding a protein partner: the interaction hot spots. Hotspotter has been trained and tested here on a curated dataset assembled from 1046 non-redundant alanine scanning mutation sites with experimentally measured change in binding free energy values from 97 different protein complexes; this dataset is available to download. The accessible surface area of the wild-type residue at a given site and its degree of evolutionary conservation proved the most important features to identify hot spots. A variant classifier was trained and validated for proteins where only the amino acid sequence is available, augmented by secondary structure assignment. This version of hotspotter requiring fewer features is almost as robust as the structure-based classifier. Application to the ACE2 (angiotensin converting enzyme 2) receptor, which mediates COVID-19 virus entry into human cells, identified the critical hot spot triad of ACE2 residues at the center of the small interface with the CoV-2 spike protein. Hotspotter results can be used to guide the strategic design of protein interfaces and ligands and also to identify likely interfacial residues for protein:protein docking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Nguyen TA, Cigler M, Lang K (2018) Expanding the genetic code to study protein–protein interactions. Angew Chemie Int Ed 57:14350–14361. https://doi.org/10.1002/anie.201805869
Article CAS Google Scholar
Cuervo NZ, Grandvaux N (2020) Ace2: evidence of role as entry receptor for sars-cov-2 and implications in comorbidities. elife 9:1–25. https://doi.org/10.7554/eLife.61390
Article Google Scholar
Horton N, Lewis M (1992) Calculation of the free energy of association for protein complexes. Protein Sci 1:169–181. https://doi.org/10.1002/pro.5560010117
Article CAS PubMed PubMed Central Google Scholar
Sanchez G (2013) Las instituciones de ciencia y tecnología en los procesos de aprendizaje de la producción agroalimentaria en Argentina. El Sist argentino innovación Inst Empres y redes 14:659–664. https://doi.org/10.1002/prot
Article Google Scholar
Lo Conte L, Chothia C, Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285:2177–2198. https://doi.org/10.1006/jmbi.1998.2439
Article CAS PubMed Google Scholar
Moreira IS, Fernandes PA, Ramos MJ (2007) Hot spots—a review of the protein – protein interface determinant amino-acid residues. Proteins 68(4):803–812. https://doi.org/10.1002/prot.21396
Article CAS PubMed Google Scholar
Thorn KS, Bogan AA (2001) ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics 17:284–285. https://doi.org/10.1093/bioinformatics/17.3.284
Article CAS PubMed Google Scholar
Cukuroglu E, Engin HB, Gursoy A, Keskin O (2014) Hot spots in protein-protein interfaces: towards drug discovery. Prog Biophys Mol Biol 116:165–173. https://doi.org/10.1016/j.pbiomolbio.2014.06.003
Article CAS PubMed Google Scholar
Carro L (2018) Protein-protein interactions in bacteria: a promising and challenging avenue towards the discovery of new antibiotics. Beilstein J Org Chem 14:2881–2896. https://doi.org/10.3762/bjoc.14.267
Article CAS PubMed PubMed Central Google Scholar
Huo S, Massova I, Kollman PA (2002) Computational alanine scanning of the 1:1 human growth hormone-receptor complex. J Comput Chem 23:15–27. https://doi.org/10.1002/jcc.1153
Article CAS PubMed Google Scholar
Deng L, Guan J, Wei X et al (2013) Boosting prediction performance of protein-protein interaction hot spots by using structural neighborhood properties. J Comput Biol 20:878–891. https://doi.org/10.1007/978-3-642-37195-0_32
Article CAS PubMed PubMed Central Google Scholar
Guerois R, Nielsen JE, Serrano L (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 320:369–387. https://doi.org/10.1016/S0022-2836(02)00442-4
Article CAS PubMed Google Scholar
Darnell SJ, LeGault L, Mitchell JC (2008) KFC Server: interactive forecasting of protein interaction hot spots. Nucleic Acids Res 36:265–269. https://doi.org/10.1093/nar/gkn346
Article CAS Google Scholar
Kortemme T, Baker D (2002) A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci 99:14116–14121. https://doi.org/10.1073/pnas.202485799
Article CAS PubMed PubMed Central Google Scholar
Zhu X, Mitchell JC (2011) KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density, and plasticity features. Proteins Struct Funct Bioinf 79:2671–2683. https://doi.org/10.1002/prot.23094
Article CAS Google Scholar
Xia J-F, Zhao X-M, Song J, Huang D-S (2010) APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinf 11:174. https://doi.org/10.1186/1471-2105-11-174
Article CAS Google Scholar
Qiao Y, Xiong Y, Gao H et al (2018) Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinf 19:1–16. https://doi.org/10.1186/s12859-018-2009-5
Article CAS Google Scholar
Wang H, Liu C, Deng L (2018) Enhanced prediction of hot spots at protein-protein interfaces using extreme gradient boosting. Sci Rep 8:1–13. https://doi.org/10.1038/s41598-018-32511-1
Article CAS Google Scholar
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349:255–260. https://doi.org/10.1126/science.aaa8415
Article CAS PubMed Google Scholar
Rose PW, Prlić A, Altunkaya A et al (2017) The RCSB Protein Data Bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res 45:D271–D281. https://doi.org/10.1093/nar/gkw1000
Article CAS PubMed Google Scholar
Fischer TB, Arunachalam KV, Bailey D et al (2003) The binding inteference database (BID): a compilation of amino acid hot spots in protein interfaces. Bioinformatics 19:1453–1454. https://doi.org/10.1093/bioinformatics/btg163
Article CAS PubMed Google Scholar
Moal IH, Fernández-Recio J (2012) SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28:2600–2607. https://doi.org/10.1093/bioinformatics/bts489
Article CAS PubMed Google Scholar
Sirin S, Apgar JR, Bennett EM, Keating AE (2016) AB-Bind: antibody binding mutational database for computational affinity predictions. Protein Sci 25:393–409. https://doi.org/10.1002/pro.2829
Article CAS PubMed Google Scholar
Chiu ML, Gilliland GL (2016) Engineering antibody therapeutics. Curr Opin Struct Biol 38:163–173. https://doi.org/10.1016/j.sbi.2016.07.012
Article CAS PubMed Google Scholar
Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001) Protein flexibility predictions using graph theory. Proteins 165:150–165
Article Google Scholar
Raschka S, Bemister-Buffington J, Kuhn LA (2016) Detecting the native ligand orientation by interfacial rigidity: SiteInterlock. Proteins 84:1888–1901. https://doi.org/10.1002/prot.25172
Article CAS PubMed Google Scholar
Sanschagrin PC, Kuhn LA (1998) Cluster analysis of consensus water sites in thrombin and trypsin shows conservation between serine proteases and contributions to ligand specificity. Protein Sci 7:2054–2064. https://doi.org/10.1002/pro.5560071002
Article CAS PubMed PubMed Central Google Scholar
Carugo O (2018) How large B-factors can be in protein crystal structures. BMC Bioinf 19:1–9. https://doi.org/10.1186/s12859-018-2083-8
Article CAS Google Scholar
Hu J, Li J, Chen N, Zhang X (2016) Conservation of hot regions in protein–protein interaction in evolution. Methods 110:73–80. https://doi.org/10.1016/j.ymeth.2016.06.020
Article CAS PubMed Google Scholar
Ashkenazy H, Abadi S, Martz E et al (2016) ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res 44:1–7. https://doi.org/10.1093/nar/gkw408
Article CAS Google Scholar
Joosten RP, Te Beek TAH, Krieger E et al (2011) A series of PDB related databases for everyday needs. Nucleic Acids Res 39:411–419. https://doi.org/10.1093/nar/gkq1105
Article CAS Google Scholar
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637
Article CAS PubMed Google Scholar
Abraham MJ, Murtola T, Schulz R et al (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2:19–25. https://doi.org/10.1016/j.softx.2015.06.001
Article Google Scholar
Eisenhaber F, Lijnzaad P, Argos P et al (1995) The double cubic lattice method: efficient approaches to numerical integration of surface area and volume and to dot surface contouring of molecular assemblies. J Comput Chem 16:273–284
Article CAS Google Scholar
Raschka S, Liu YH, Mirjalili V (2022) Machine learning with PyTorch and Scikit-learn. Packt, Birmingham-Mumbai
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Raschka S (2018) MLxtend: providing machine learning and data science utilities and extensions to Python’s scientific computing stack. J Open Source Softw 3:638. https://doi.org/10.21105/joss.00638
Article Google Scholar
Chan KK, Dorosky D, Sharma P et al (2020) Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2. Science 369:1261–1265. https://doi.org/10.1126/SCIENCE.ABC0870
Article CAS PubMed PubMed Central Google Scholar
Lan J, Ge J, Yu J et al (2020) Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature 581:215–220. https://doi.org/10.1038/s41586-020-2180-5
Article CAS PubMed Google Scholar
Berman HM, Battistuz T, Bhat TN et al (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res 30:245–248
Google Scholar
Tunyasuvunakool K, Adler J, Wu Z et al (2021) Highly accurate protein structure prediction for the human proteome. Nature 596:590–596. https://doi.org/10.1038/s41586-021-03828-1
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

We sincerely thank Prof. Jerry Tsai for providing a copy of his Binding Interface Database.

Author information

Authors and Affiliations

Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, PA, USA
Jiaxing Chen
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
Jiaxing Chen, Leslie A. Kuhn & Sebastian Raschka
Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA
Sebastian Raschka

Authors

Jiaxing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Leslie A. Kuhn
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Raschka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leslie A. Kuhn .

Editor information

Editors and Affiliations

Department of Basic and Applied Sciences, Dayananda Sagar University, Bangalore, Karnataka, India
Mohini Gore
Department of Botany, Rajaram College Kolhapur, Kolhapur, Maharashtra, India
Umesh B. Jagtap

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Chen, J., Kuhn, L.A., Raschka, S. (2024). Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Protein Interaction Hot Spots. In: Gore, M., Jagtap, U.B. (eds) Computational Drug Discovery and Design. Methods in Molecular Biology, vol 2714. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3441-7_14

Download citation

DOI: https://doi.org/10.1007/978-1-0716-3441-7_14
Published: 08 September 2023
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-3440-0
Online ISBN: 978-1-0716-3441-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics