Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Prediction of peptide mass spectral libraries with machine learning

Abstract

The recent development of machine learning methods to identify peptides in complex mass spectrometric data constitutes a major breakthrough in proteomics. Longstanding methods for peptide identification, such as search engines and experimental spectral libraries, are being superseded by deep learning models that allow the fragmentation spectra of peptides to be predicted from their amino acid sequence. These new approaches, including recurrent neural networks and convolutional neural networks, use predicted in silico spectral libraries rather than experimental libraries to achieve higher sensitivity and/or specificity in the analysis of proteomics data. Machine learning is galvanizing applications that involve large search spaces, such as immunopeptidomics and proteogenomics. Current challenges in the field include the prediction of spectra for peptides with post-translational modifications and for cross-linked pairs of peptides. Permeation of machine-learning-based spectral prediction into search engines and spectrum-centric data-independent acquisition workflows for diverse peptide classes and measurement conditions will continue to push sensitivity and dynamic range in proteomics applications in the coming years.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Fragmentation spectra in shotgun proteomics.
Fig. 2: DDA and DIA.
Fig. 3: Machine learning.
Fig. 4: Machine learning strategies for ion series intensity prediction.
Fig. 5: DDA applications.
Fig. 6: DIA applications.

Similar content being viewed by others

References

  1. Wolters, D. A., Washburn, M. P. & Yates, J. R. An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. 73, 5683–5690 (2001).

    Article  CAS  Google Scholar 

  2. Zhang, Y., Fonslow, B. R., Shan, B., Baek, M. C. & Yates, J. R. Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 113, 2343–2394 (2013).

    Article  CAS  Google Scholar 

  3. Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).

    Article  CAS  Google Scholar 

  4. Sinitcyn, P., Rudolph, J. D. & Cox, J. Computational methods for understanding mass spectrometry–based shotgun proteomics data. Annu. Rev. Biomed. Data Sci. 1, 207–234 (2018).

    Article  Google Scholar 

  5. Roepstorff, P. & Fohlman, J. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biol. Mass. Spectrom. 11, 601 (1984).

    Article  CAS  Google Scholar 

  6. Steen, H. & Mann, M. The ABC’s (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell Biol. 5, 699–711 (2004).

    Article  CAS  Google Scholar 

  7. Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC–MS/MS data in metabolomics. Metabolites 8, 31 (2018).

    Article  Google Scholar 

  8. Biemann, K. Contributions of mass spectrometry to peptide and protein structure. Biol. Mass. Spectrom. 16, 99–111 (1988).

    Article  CAS  Google Scholar 

  9. Mitchell Wells, J. & McLuckey, S. A. Collision-induced dissociation (CID) of peptides and proteins. Methods Enzymol. 402, 148–185 (2005).

    Article  Google Scholar 

  10. Olsen, J. V. et al. Higher-energy C-trap dissociation for peptide modification analysis. Nat. Methods 4, 709–712 (2007).

    Article  CAS  Google Scholar 

  11. Syka, J. E. P., Coon, J. J., Schroeder, M. J., Shabanowitz, J. & Hunt, D. F. Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc. Natl Acad. Sci. USA 101, 9528–9533 (2004).

    Article  CAS  Google Scholar 

  12. Borges, R. M. et al. Quantum chemistry calculations for metabolomics. Chem. Rev. 121, 5633–5670 (2021).

    Article  CAS  Google Scholar 

  13. Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass. Spectrom. 5, 976–989 (1994).

    Article  CAS  Google Scholar 

  14. Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    Article  CAS  Google Scholar 

  15. Cox, J. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 10, 1794–1805 (2011).

    Article  CAS  Google Scholar 

  16. Zhang, Z. Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 76, 3908–3922 (2004).

    Article  CAS  Google Scholar 

  17. Boyd, R. & Somogyi, Á. The mobile proton hypothesis in fragmentation of protonated peptides: a perspective. J. Am. Soc. Mass. Spectrom. 21, 1275–1278 (2010).

    Article  CAS  Google Scholar 

  18. Tiwary, S. et al. High quality MS/MS spectrum prediction for data-dependent and -independent acquisition data analysis. Nat. Methods 16, 519–525 (2019).

    Article  CAS  Google Scholar 

  19. Verbruggen, S. et al. Spectral prediction features as a solution for the search space size problem in proteogenomics. Mol. Cell. Proteom. 20, 100076 (2021).

    Article  CAS  Google Scholar 

  20. Wilhelm, M. et al. Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics. Nat. Commun. 12, 3346 (2021).

    Article  CAS  Google Scholar 

  21. Domokos, L., Hennberg, D. & Weimann, B. Computer-aided identification of compounds by comparison of mass spectra. Anal. Chim. Acta 165, 61–74 (1984).

    Article  CAS  Google Scholar 

  22. Yates, J. R., Morgan, S. F., Gatlin, C. L., Griffin, P. R. & Eng, J. K. Method to compare collision-induced dissociation spectra of peptides: potential for library searching and subtractive analysis. Anal. Chem. 70, 3557–3565 (1998).

    Article  CAS  Google Scholar 

  23. Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 5, 859–866 (1994).

    Article  CAS  Google Scholar 

  24. Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667 (2007).

    Article  CAS  Google Scholar 

  25. Neuhauser, N., Michalski, A., Cox, J. & Mann, M. Expert system for computer-assisted annotation of MS/MS spectra. Mol. Cell Proteom. 11, 1500–1509 (2012).

    Article  Google Scholar 

  26. Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P. & Gygi, S. P. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214–219 (2004).

    Article  CAS  Google Scholar 

  27. Arnold, R. J., Jayasankar, N., Aggarwal, D., Tang, H. & Radivojac, P. A machine learning approach to predicting peptide fragmentation spectra. Pac. Symp. Biocomput. 230, 219–230 (2006).

    Google Scholar 

  28. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  CAS  Google Scholar 

  29. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  30. Zhou, X. X. et al. PDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).

    Article  CAS  Google Scholar 

  31. Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).

    Article  CAS  Google Scholar 

  32. Yang, Y., Lin, L. & Qiao, L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev. Proteom. 18, 1031–1043 (2021).

    Article  CAS  Google Scholar 

  33. Wen, B. et al. Deep Learning in Proteomics. Proteomics 20, 1900335 (2020).

    Article  CAS  Google Scholar 

  34. Meyer, J. G. Deep learning neural network tools for proteomics. Cell Rep. Methods 1, 100003 (2021).

    Article  CAS  Google Scholar 

  35. Lange, V., Picotti, P., Domon, B. & Aebersold, R. Selected reaction monitoring for quantitative proteomics: a tutorial. Mol. Syst. Biol. 4, 222 (2008).

    Article  Google Scholar 

  36. Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteom. 11, O111.016717 (2012).

    Article  Google Scholar 

  37. Deutsch, E. W. et al. Expanding the use of spectral libraries in proteomics. J. Proteome Res. 17, 4051–4060 (2018).

    Article  CAS  Google Scholar 

  38. Venable, J. D., Dong, M. Q., Wohlschlegel, J., Dillin, A. & Yates, J. R. Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra. Nat. Methods 1, 39–45 (2004).

    Article  CAS  Google Scholar 

  39. Egertson, J. D. et al. Multiplexed MS/MS for improved data-independent acquisition. Nat. Methods 10, 744–746 (2013).

    Article  CAS  Google Scholar 

  40. Distler, U. et al. Drift time-specific collision energies enable deep-coverage data-independent acquisition proteomics. Nat. Methods 11, 167–170 (2014).

    Article  CAS  Google Scholar 

  41. Ludwig, C. et al. Data‐independent acquisition‐based SWATH‐MS for quantitative proteomics: a tutorial. Mol. Syst. Biol. 14, e8126 (2018).

    Article  Google Scholar 

  42. Doerr, A. DIA mass spectrometry. Nat. Methods 12, 35–35 (2014).

    Article  Google Scholar 

  43. Quinlan, J. R. Induction of Decision Trees. Mach. Learn. 1, 81–106 (1986).

    Article  Google Scholar 

  44. Moore, D. H. Classification and regression trees, by Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Brooks/Cole Publishing, Monterey, 1984,358 pages, $27.95. Cytometry (1987)

  45. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  46. Chen, T. & Guestrin, C. XGBoost: reliable large-scale tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Association for Computing Machinery, 2016).

  47. Vapnik, V. N. The Nature of Statistical Learning Theory. (Springer, 1995).

  48. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. & Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst. 9, 155–161 (1997).

    Google Scholar 

  49. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

    Article  Google Scholar 

  50. Yu, Y., Si, X., Hu, C. & Zhang, J. A review of recurrent neural networks: Lstm cells and network architectures. Neural Comput. 31, 1235–1270 (2019).

    Article  Google Scholar 

  51. Schuster, M. & Paliwal, K. K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997).

    Article  Google Scholar 

  52. Hochreiter, S. & Schmidhuber, J. J. Long short-term memory. Neural Comput. 9, 1–32 (1997).

    Article  CAS  Google Scholar 

  53. Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000).

    Article  CAS  Google Scholar 

  54. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Gated feedback recurrent neural networks. In 32nd International Conference on Machine Learning (eds. Bach, F. & Blei, D.) 2067–2075 (PMLR, 2015).

  55. LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).

    Article  Google Scholar 

  56. West, J., Ventura, D. & Warnick, S. Spring Research Presentation: a Theoretical Foundation for Inductive Transfer. Brigham Young Univ. (2007).

  57. Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).

  58. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. H.) 3319–3328 (PMLR, 2017).

  59. Marx, H. et al. A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nat. Biotechnol. 31, 557–564 (2013).

    Article  CAS  Google Scholar 

  60. Zolg, D. P. et al. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods 14, 259–262 (2017).

    Article  CAS  Google Scholar 

  61. Deutsch, E. W. et al. The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics. Nucleic Acids Res. 48, D1145–D1152 (2020).

    CAS  Google Scholar 

  62. Perez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 50, D543–D552 (2022).

    Article  CAS  Google Scholar 

  63. Deutsch, E. W., Lam, H. & Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 9, 429–434 (2008).

    Article  CAS  Google Scholar 

  64. Wang, M. et al. Assembling the community-scale discoverable human proteome. Cell Syst. 7, 412–421 (2018).

    Article  CAS  Google Scholar 

  65. Okuda, S. et al. JPOSTrepo: An international standard data repository for proteomes. Nucleic Acids Res. 45, D1107–D1111 (2017).

    Article  CAS  Google Scholar 

  66. Ma, J. et al. Iprox: An integrated proteome resource. Nucleic Acids Res. 47, D1211–D1217 (2019).

    Article  Google Scholar 

  67. Sharma, V. et al. Panorama public: A public repository for quantitative data sets processed in skyline. Mol. Cell. Proteom. 17, 1239–1244 (2018).

    Article  CAS  Google Scholar 

  68. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    Google Scholar 

  69. Elias, J. E. & Gygi, S. P. Target–decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

    Article  CAS  Google Scholar 

  70. Frank, A. M. et al. Clustering millions of tandem mass spectra. J. Proteome Res. 7, 113–122 (2008).

    Article  CAS  Google Scholar 

  71. Griss, J. et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 13, 651–656 (2016).

    Article  CAS  Google Scholar 

  72. Savitski, M. M. et al. Targeted data acquisition for improved reproducibility and robustness of proteomic mass spectrometry assays. J. Am. Soc. Mass. Spectrom. 21, 1668–1679 (2010).

    Article  CAS  Google Scholar 

  73. Michalski, A., Cox, J. & Mann, M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC–MS/MS. J. Proteome Res 10, 1785–1793 (2011).

    Article  CAS  Google Scholar 

  74. Wan, K. X., Vidavsky, I. & Gross, M. L. Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass. Spectrom. 13, 85–88 (2002).

    Article  CAS  Google Scholar 

  75. Liu, J. et al. Methods for peptide identification by spectral comparison. Proteome Sci. 5, 3 (2007).

    Article  Google Scholar 

  76. Shao, W., Zhu, K. & Lam, H. Refining similarity scoring to enable decoy-free validation in spectral library searching. Proteomics 13, 3273–3283 (2013).

    Article  CAS  Google Scholar 

  77. Garg, N. et al. Mass spectral similarity for untargeted metabolomics data analysis of complex mixtures. Int. J. Mass spectrom. 377, 719–727 (2015).

    Article  CAS  Google Scholar 

  78. Toprak, U. H. et al. Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Mol. Cell. Proteom. 13, 2056–2071 (2014).

    Article  CAS  Google Scholar 

  79. Li, S., Arnold, R. J., Tang, H. & Radivojac, P. On the accuracy and limits of peptide fragmentation spectrum prediction. Anal. Chem. 83, 790–796 (2010).

    Article  Google Scholar 

  80. Tarn, C. & Zeng, W. F. PDeep3: toward more accurate spectrum prediction with fast few-shot learning. Anal. Chem. 93, 5815–5822 (2021).

    Article  CAS  Google Scholar 

  81. Guan, S., Moran, M. F. & Ma, B. Prediction of LC–MS/MS properties of peptides from sequence by deep learning. Mol. Cell. Proteom. 18, 2099–2107 (2019).

    Article  Google Scholar 

  82. Lin, Y. M., Chen, C. T. & Chang, J. M. MS2CNN: predicting MS/MS spectrum based on protein sequence using deep convolutional neural networks. BMC Genomics 20, 906 (2019).

    Article  CAS  Google Scholar 

  83. Cho, K., van Merriënboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: encoder–decoder approaches. In Proc. 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (Association for Computational Linguistics, 2014).

  84. Degroeve, S., Martens, L. & Jurisica, I. MS2PIP: a tool for MS/MS peak intensity prediction. Bioinformatics 29, 3199–3203 (2013).

    Article  CAS  Google Scholar 

  85. Degroeve, S., Maddelein, D. & Martens, L. MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Res. 41, W326–W330 (2015).

    Article  Google Scholar 

  86. Gabriels, R., Martens, L. & Degroeve, S. Updated MS2PIP web server delivers fast and accurate MS2 peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques. Nucleic Acids Res. 47, W295–W299 (2019).

    Article  CAS  Google Scholar 

  87. Zhou, C., Bowler, L. D. & Feng, J. A machine learning approach to explore the spectra intensity pattern of peptides using tandem mass spectrometry data. BMC Bioinf. 9, 325 (2008).

    Article  Google Scholar 

  88. Frank, A. M. Predicting intensity ranks of peptide fragment ions. J. Proteome Res. 8, 2226–2240 (2009).

    Article  CAS  Google Scholar 

  89. Dong, N. P. et al. Prediction of peptide fragment ion mass spectra by data mining techniques. Anal. Chem. 86, 7446–7454 (2014).

    Article  CAS  Google Scholar 

  90. Welker, F. et al. The dental proteome of Homo antecessor. Nature 580, 235–238 (2020).

    Article  CAS  Google Scholar 

  91. Liu, K., Li, S., Wang, L., Ye, Y. & Tang, H. Full-spectrum prediction of peptides tandem mass spectra using deep neural network. Anal. Chem. 92, 4275–4283 (2020).

    Article  CAS  Google Scholar 

  92. Caruana, R. Multitask learning. Mach. Learn. 28, 41–75 (1997).

    Article  Google Scholar 

  93. French, R. M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3, 128–135 (1999).

    Article  CAS  Google Scholar 

  94. Frese, C. K. et al. Toward full peptide sequence coverage by dual fragmentation combining electron-transfer and higher-energy collision dissociation tandem mass spectrometry. Anal. Chem. 84, 9668–9673 (2012).

    Article  CAS  Google Scholar 

  95. Brodbelt, J. S., Morrison, L. J. & Santos, I. Ultraviolet photodissociation mass spectrometry for analysis of biological molecules. Chem. Rev. 120, 3328–3380 (2020).

    Article  CAS  Google Scholar 

  96. Zeng, W. F. et al. MS/MS spectrum prediction for modified peptides using pDeep2 trained by transfer learning. Anal. Chem. 91, 9724–9731 (2019).

    Article  CAS  Google Scholar 

  97. Bouwmeester, R., Gabriels, R., Hulstaert, N., Martens, L. & Degroeve, S. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat. Methods 18, 1363–1369 (2021).

    Article  Google Scholar 

  98. Reily, C., Stewart, T. J., Renfrow, M. B. & Novak, J. Glycosylation in health and disease. Nat. Rev. Nephrol. 15, 346–366 (2019).

    Article  Google Scholar 

  99. Yang, Y., Horvatovich, P. & Qiao, L. Fragment mass spectrum prediction facilitates site localization of phosphorylation. J. Proteome Res. 20, 634–644 (2021).

    Article  CAS  Google Scholar 

  100. Lou, R. et al. DeepPhospho accelerates DIA phosphoproteome profiling through in silico library generation. Nat. Commun. 12, 6685 (2021).

    Article  CAS  Google Scholar 

  101. O’Reilly, F. J. & Rappsilber, J. Cross-linking mass spectrometry: methods and applications in structural, molecular and systems biology. Nat. Struct. Mol. Biol. 25, 1000–1008 (2018).

    Article  Google Scholar 

  102. Chen, Z. L., Mao, P. Z., Zeng, W. F., Chi, H. & He, S. M. PDeepXL: MS/MS spectrum prediction for cross-linked peptide pairs by deep learning. J. Proteome Res. 20, 2570–2582 (2021).

    Article  CAS  Google Scholar 

  103. Giese, S. H., Sinn, L. R., Wegner, F. & Rappsilber, J. Retention time prediction using neural networks increases identifications in crosslinking mass spectrometry. Nat. Commun. 12, 3237 (2021).

    Article  CAS  Google Scholar 

  104. Yılmaz, Ş., Busch, F., Nagaraj, N. & Cox, J. Accurate and automated high-coverage identification of chemically cross-linked peptides with MaxLynx. Anal. Chem. 94, 1608–1617 (2022).

    Article  Google Scholar 

  105. Tabb, D. L., Fernando, C. G. & Chambers, M. C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J. Proteome Res. 6, 654–661 (2007).

    Article  CAS  Google Scholar 

  106. Narasimhan, C. et al. MASPIC: intensity-based tandem mass spectrometry scoring scheme that improves peptide identification at high confidence. Anal. Chem. 77, 7581–7593 (2005).

    Article  CAS  Google Scholar 

  107. Sadygov, R., Wohlschlegel, J., Park, S. K., Xu, T. & Yates, J. R. Central limit theorem as an approximation for intensity-based scoring function. Anal. Chem. 78, 89–95 (2006).

    Article  CAS  Google Scholar 

  108. Silva, A. S. C., Bouwmeester, R., Martens, L. & Degroeve, S. Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions. Bioinformatics 35, 5243–5248 (2019).

    Article  CAS  Google Scholar 

  109. Bateman, A. et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).

    Article  Google Scholar 

  110. Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).

    Article  Google Scholar 

  111. The, M., MacCoss, M. J., Noble, W. S. & Käll, L. Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0. J. Am. Soc. Mass. Spectrom. 27, 1719–1727 (2016).

    Article  CAS  Google Scholar 

  112. Kim, S. & Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 5, 5277 (2014).

    Article  CAS  Google Scholar 

  113. Chong, C., Coukos, G. & Bassani-Sternberg, M. Identification of tumor antigens with immunopeptidomics. Nat. Biotechnol. 40, 175–188 (2021).

    Article  Google Scholar 

  114. Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).

    Article  CAS  Google Scholar 

  115. Wilmes, P. & Bond, P. L. Metaproteomics: studying functional gene expression in microbial ecosystems. Trends Microbiol. 14, 92–97 (2006).

    Article  CAS  Google Scholar 

  116. Kloetzel, P. M. Antigen processing by the proteasome. Nat. Rev. Mol. Cell Biol. 2, 179–188 (2001).

    Article  CAS  Google Scholar 

  117. Coulie, P. G. et al. A mutated intron sequence codes for an antigenic peptide recognized by cytolytic T lymphocytes on a human melanoma. Proc. Natl Acad. Sci. USA 92, 7976–7980 (1995).

    Article  CAS  Google Scholar 

  118. Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).

    Article  CAS  Google Scholar 

  119. Sahin, U. et al. Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer. Nature 547, 222–226 (2017).

    Article  CAS  Google Scholar 

  120. Hunt, D. F. et al. Characterization of peptides bound to the class I MHC molecule HLA-A2.1 by mass spectrometry. Science 255, 1261–1263 (1992).

    Article  CAS  Google Scholar 

  121. Admon, A. & Bassani-Sternberg, M. The human immunopeptidome project, a suggestion for yet another postgenome next big thing. Mol. Cell. Proteom. 10, O111.011833 (2011).

    Article  Google Scholar 

  122. Li, K., Jain, A., Malovannaya, A., Wen, B. & Zhang, B. DeepRescore: leveraging deep learning to improve peptide identification in immunopeptidomics. Proteomics 20, 1900334 (2020).

    Article  CAS  Google Scholar 

  123. Sarkizova, S. et al. A large peptidome dataset improves HLA class I epitope prediction across most of the human population. Nat. Biotechnol. 38, 199–209 (2020).

    Article  CAS  Google Scholar 

  124. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

    Article  CAS  Google Scholar 

  125. Sinitcyn, P. et al. MaxQuant goes Linux. Nat. Methods 15, 401 (2018).

    Article  CAS  Google Scholar 

  126. Liepe, J. et al. A large fraction of HLA class I ligands are proteasome-generated spliced peptides. Science 354, 354–358 (2016).

    Article  CAS  Google Scholar 

  127. Faridi, P. et al. A subset of HLA-I peptides are not genomically templated: Evidence for cis- and trans-spliced peptide ligands. Sci. Immunol. 3, eaar3947 (2018).

    Article  Google Scholar 

  128. Specht, G. et al. Large database for the analysis and prediction of spliced and non-spliced peptide generation by proteasomes. Sci. Data 7, 146 (2020).

    Article  CAS  Google Scholar 

  129. McGlincy, N. J. & Ingolia, N. T. Transcriptome-wide measurement of translation by ribosome profiling. Methods 126, 112–129 (2017).

    Article  Google Scholar 

  130. Garalde, D. R. et al. Highly parallel direct RN A sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).

    Article  CAS  Google Scholar 

  131. Schoenholz, S. S. et al. Peptide-spectra matching from weak supervision. Preprint at arXiv https://doi.org/10.48550/arXiv.1808.06576 (2018).

  132. Tsou, C.-C. et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat. Methods 12, 258–264 (2015).

    Article  CAS  Google Scholar 

  133. Li, Y. et al. Group-DIA: Analyzing multiple data-independent acquisition mass spectrometry data files. Nat. Methods 12, 1105–1106 (2015).

    Article  CAS  Google Scholar 

  134. Bekker-Jensen, D. B. et al. Rapid and site-specific deep phosphoproteome profiling by data-independent acquisition without the need for spectral libraries. Nat. Commun. 11, 787 (2020).

    Article  CAS  Google Scholar 

  135. MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010).

    Article  CAS  Google Scholar 

  136. Röst, H. L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32, 219–223 (2014).

    Article  Google Scholar 

  137. Bruderer, R. et al. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol. Cell. Proteom. 14, 1400–1410 (2015).

    Article  CAS  Google Scholar 

  138. Keller, A., Bader, S. L., Shteynberg, D., Hood, L. & Moritz, R. L. Automated validation of results and removal of fragment ion interferences in targeted analysis of data-independent acquisition mass spectrometry (MS) using SWATHProphet. Mol. Cell. Proteom. 14, 1411–1418 (2015).

    Article  CAS  Google Scholar 

  139. Meyer, J. G. et al. PIQED: automated identification and quantification of protein modifications from DIA-MS data. Nat. Methods 14, 646–647 (2017).

    Article  CAS  Google Scholar 

  140. Searle, B. C. et al. Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Nat. Commun. 9, 5128 (2018).

    Article  Google Scholar 

  141. Peckner, R. et al. Specter: linear deconvolution for targeted analysis of data-independent acquisition mass spectrometry proteomics. Nat. Methods 15, 371–378 (2018).

    Article  CAS  Google Scholar 

  142. Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).

    Article  CAS  Google Scholar 

  143. Sinitcyn, P. et al. MaxDIA enables library-based and library-free data-independent acquisition proteomics. Nat. Biotechnol. 39, 1563–1573 (2021).

    Article  CAS  Google Scholar 

  144. Searle, B. C. et al. Generating high quality libraries for DIA MS with empirically corrected peptide predictions. Nat. Commun. 11, 1548 (2020).

    Article  CAS  Google Scholar 

  145. Lou, R. et al. Hybrid spectral library combining DIA-MS data and a targeted virtual library substantially deepens the proteome coverage. iScience 23, 100903 (2020).

    Article  CAS  Google Scholar 

  146. Yang, Y. et al. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat. Commun. 11, 146 (2020).

    Article  CAS  Google Scholar 

  147. Isaksson, M., Karlsson, C., Laurell, T., Kirkeby, A. & Heusel, M. MSLibrarian: optimized predicted spectral libraries for data-independent acquisition proteomics. J. Proteome Res. 21, 535–546 (2022).

    Article  CAS  Google Scholar 

  148. Smith, L. M. & Kelleher, N. L. Proteoforms as the next proteomics currency. Science 359, 1106–1107 (2018).

    Article  CAS  Google Scholar 

  149. Aebersold, R. et al. How many human proteoforms are there? Nat. Chem. Biol. 14, 206–214 (2018).

    Article  CAS  Google Scholar 

  150. Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F. & Whitehouse, C. M. Electrospray ionization for mass spectrometry of large biomolecules. Science 246, 64–71 (1989).

    Article  CAS  Google Scholar 

  151. Hillenkamp, F., Karas, M., Beavis, R. C. & Chait, B. T. Matrix-assisted laser desorption/ionization mass spectrometry of biopolymers. Anal. Chem. 63, 1193A–1203A (1991).

    Article  CAS  Google Scholar 

  152. Bateman, R. H. et al. A novel precursor ion discovery method on a hybrid quadrupole orthogonal acceleration time-of-flight (Q-TOF) mass spectrometer for studying protein phosphorylation. J. Am. Soc. Mass. Spectrom. 13, 792–803 (2002).

    Article  CAS  Google Scholar 

  153. Geiger, T., Cox, J. & Mann, M. Proteomics on an Orbitrap benchtop mass spectrometer using all-ion fragmentation. Mol. Cell Proteom. 9, 2252–2261 (2010).

    Article  CAS  Google Scholar 

  154. Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).

    Google Scholar 

  155. Coscia, F. et al. A streamlined mass spectrometry-based proteomics workflow for large-scale FFPE tissue analysis. J. Pathol. 251, 100–112 (2020).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

I thank G. Borner, B. Frohn, T. Geiger, J. L. Restrepo-López, F. Traube and S. Yilmaz for critical reading of the manuscript; C. De Nart for assistance with Fig. 1 and P. Sinitcyn for the re-analysis of data in Fig. 6a. This project was partially funded by the German Ministry for Science and Education (BMBF) funding action MSCoreSys, reference number FKZ 031L0214D.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jürgen Cox.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cox, J. Prediction of peptide mass spectral libraries with machine learning. Nat Biotechnol 41, 33–43 (2023). https://doi.org/10.1038/s41587-022-01424-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-022-01424-w

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research