Introduction

Post-translational modifications (PTMs) have long been of interest to proteomics researchers because of their central role in regulating cellular functions. Processes to maximize their recovery run the gamut of proteomics techniques, from sample preparation1 to instrumental acquisition2 and computational analysis3,4,5. At the computational level, proteomics search engines have grown in their capacity to identify PTMs. For PTMs with complex fragmentation patterns like glycosylation that exhibit multiple modes of fragmentation, entire search engines specific to the modification class have been developed4,6,7. Despite this work, many modifications continue to suffer from low recall in standard high-throughput workflows due to their behavior during tandem mass spectrometry (MS) analysis, producing unexpected or difficult fragmentation patterns that frustrate search engines8. Even small changes to workflows—such as the addition of isobaric labels—can alter fragmentation patterns and reduce or preclude identification of even the best-studied PTMs9. Recent work with synthetic peptides carrying less well-studied PTMs demonstrated that many diagnostic ions and neutral losses have yet to be identified10.

With the proliferation of synthetic PTMs11—particularly ones that alter fragmentation patterns9—and new instrumental methods2,12, keeping search engines up to date with knowledge of how an analyte will fragment in a particular setting is a large task. To overcome this, computational tools are being developed to identify modification fragmentation patterns without prior knowledge. The first such tools only identified diagnostic ions and were limited in their applications13, but newer approaches have incorporated additional features. Synthetic peptides bearing modifications are generally seen as gold standards to study PTM fragmentation patterns and methods have been developed to extract them from spectra14, but this approach adds additional benchwork to proteomics experiments. Furthermore, optimal search parameters are fragmentation-dependent and can change based on experimental settings, which requires reprocessing mass spectrometry data and reanalyzing fragmentation patterns for multiple experiment types. Zolg et al. developed a method to do this in a high-throughput manner10, but it requires paired modified-unmodified peptides and cannot be easily reimplemented by other research groups. Their approach to identify neutral losses also requires both the intact and fragmented peak to be present in the spectrum at consistent distances, precluding finding complete losses and many charged losses. Other approaches to score PSMs from modified peptides are trained for specific PTMs15 or perform model refinement that focuses on distances between experimental peaks, discarding information about matched ions from the peptide backbone that would dramatically reduce the required training dataset size16. Chemoproteomics has a particular stake in this effort due to the diversity of probes employed17,18,19. However, existing tools for chemoproteomics require isotopic labeling signatures to be present at the MS1 and MS2 levels20. This limits their applications to chemical probes that are labeled non-isobarically, thus they cannot be used for some PTM probes21, biological PTMs, or the development of isobaric mass tags22. In prior work, we have found that understanding PTM fragmentation patterns allowed us to maximize modified peptide recovery and localization. Thus, when studying a cysteine chemoproteomic probe, we developed a method to extract its diagnostic spectral features to improve coverage of the ligandable cysteineome19. Our approach did not require synthesizing standards or isotopically labeled peptides and facilitated the discovery of partial modification losses and diagnostic ions, ultimately leading to the identification of three diagnostic ions and two partial PTM fragmentation events that escaped manual inspection.

Here, we present an improved, fully automated, and empirically tuned implementation of our diagnostic feature extraction algorithm to study and score the fragmentation patterns of modifications. Our approach detects three separate types of diagnostic features—diagnostic ions, peptide remainder masses, and fragment remainder masses—and can be used in any experimental setting, including for the simultaneous characterization of multiple modifications and when only a handful of PSMs are present for a modification. We demonstrate the robustness of our technique by applying it at both massive and small scale, and across synthetic and biological PTMs. Finally, we perform a meta-analysis of diagnostic features and discuss how these can be used to further PTM discovery in diverse settings. Our method has been implemented within PTM-Shepherd23 and is freely available as part of the FragPipe suite of tools (https://fragpipe.nesvilab.org/).

Results

Algorithm overview

The PTM-Shepherd diagnostic feature mining module aims to perform high throughput identification of spectral features that can be used to identify post-translational modifications (PTMs), facilitating the validation or discovery of PTM-specific signals. Probable modifications from an experiment are identified by passing the results of open or mass offset search to PTM-Shepherd. For each MS1 mass shift, PTM-Shepherd identifies enriched diagnostic features across three categories: diagnostic ions; mass shifts from the unmodified, intact peptide ions (peptide remainder masses); and mass shifts from unmodified fragment ions (fragment remainder masses). This module operates in three steps: calculating all possible spectral features for every peptide-spectrum match (PSM) with a particular mass shift, identifying the most abundant spectral features for every identified mass shift within each category, then finally performing statistical tests and filtering to see whether those features can be used to infer the presence of the modification via comparison to unmodified peptides. This module uses as input decharged and deisotoped MGF spectra produced by MSFragger24, so the maximum charge state for all ions in MS/MS spectra is assumed to be one. Spectral ions are normalized to the base peak and only the top 150 peaks are considered (by default).

We illustrate our technique using a cysteine-specific chemical probe19 that we previously analyzed with an early version of the algorithm, identifying all three ion types (Fig. 1a). The first step in our strategy is to calculate all possible diagnostic spectral features for each PSM within a mass shift identified by PTM-Shepherd. Any ions from experimental spectra that do not belong to the peptide are considered potential diagnostic features for the mass shift. To identify recurring features for the mass shift, calculated features for every spectrum from the mass shift are sent to a common histogram. Peaks are identified from here and shuttled to downstream analysis. For diagnostic ions, the unannotated ions from the experimental spectrum are sent to their histogram as they are (Fig. 1a, green). Peptide remainder masses are calculated by computing mass differences between the theoretical, unshifted peptide ion (purple) and all ions in the spectrum (blue). Fragment remainder masses are calculated by iteratively computing mass differences between every theoretical ion from the peptide backbone (purple) and all ions in the spectrum.

Fig. 1: Implementation of the diagnostic feature mining algorithm.
figure 1

a Calculation of all possible diagnostic ions, peptide remainders, and fragment remainders, with the latter two calculated with respect to theoretical unmodified ions, for a synthetic Cys probe of mass 463 Da. Calculated features across spectra are subsequently pooled to find recurring features. Source data are provided as a Source Data file. b Workflow for diagnostic feature selection.

Finding recurring ions does not mean that they are useful for identifying a mass shift. Our ion set contains features that might be abundant across the entire dataset, so it is necessary to remove baseline noise. We do this by comparing the recovered features from all spectra bearing the mass shift to those of unmodified peptides in bulk as a proxy for dataset background (Fig. 1b). For every feature detected in the prior step, it is quantified across modified and unmodified PSMs, with missing ions or offsets encoded as zeroes. The result is two lists of intensities, from which we can perform statistical tests. Encoding missing ions as zeroes is necessary for this step, but it can also produce a range of non-normal distributions, calling for the non-parametric Mann-Whitney-U test. Features that are significantly different between the modified and unmodified lists are then filtered for sensitivity criteria (minimum prevalence in the modified bin) and mean intensity fold-change between the two bins. Fragment remainder ions undergo an additional layer of filtering for ion formation propensity, where they are required to represent a minimum percentage of the number of ions in their series. True fragment remainder ions can also create “echoes” of their masses that are combinations of the original mass and adjacent amino acids, multiple of which can pass filtering for a mass shift. We correct these by checking for enrichment of adjacent amino acids from the residues the remainder mass is derived from and adjusting the mass accordingly. Because the adjacent residues are pseudo-random in most cases, we also reasoned that any fragment remainder mass less intense than the first corrected mass is likely to be noise. These are also filtered from the result. Additional details about this process can be found in the Methods section.

Re-analyzing the cysteine probe data used for illustration above, we identified all eight diagnostic features—five diagnostic ions (Fig. 1a, green), two peptide remainder masses (Fig. 1a, blue), and one fragment remainder mass (Fig. 1a, red)—that were annotated in the prior study and are high confidence identifications. Furthermore, we also identified two additional diagnostic ions, suggesting improved sensitivity for the empirically tuned and automated algorithm (the full attribute list can be found at Supplementary Table 1).

As has been acknowledged elsewhere20, thorough statistical evaluations of fragmentation pattern detection algorithms are difficult to compute as there are no extant datasets conducive to this analysis. As such, we attempted to evaluate the specificity of the method using a method familiar to proteomics: decoys. We reasoned that datasets with more unique PTMs with uncorrelated diagnostic ions would produce a set of decoy fragmentation patterns that best approximates the feature null distributions. We used a subset of the human proteome published by Wang et al. 25, constructing a shuffled decoy dataset by randomizing the mass shifts of individual PSMs (unshuffled PSM list at Supplementary Data 1a, shuffled PSM list at Supplementary Data 1b). The original and shuffled PSM lists were then searched in parallel. PTM-Shepherd identified 341 diagnostic features from the standard dataset (Supplementary Data 1c), but no features from the shuffled dataset (Supplementary Data 1d), for an \(\widehat{{FDP}}\) of 0. Thus, we demonstrate that our method is robust as well as sensitive.

New protocol characterization

RNA-crosslinking studies also feature labile modifications. Repeating sugar molecules can fragment in myriad ways, frustrating attempts to localize or even identify RNA moieties. Bae et al. recently developed pRBS-ID, an RNA crosslinking workflow utilizing photoactivatable nucleotides and chemical RNA cleavage to overcome these challenges26. Alongside the development of their bench technique, they needed to develop a bespoke computational workflow to identify RNA fragment remainder masses and identify and quantify their host peptides. We believed that this process could be recapitulated by PTM-Shepherd without the need for time-intensive custom workflows, and as such we struck a course to replicate their results for the commonly used 4-thiouridine (4SU) nucleotide analog27.

First, we performed an open search using the default diagnostic ion mining setting available in FragPipe. As expected in any open search, PTM-Shepherd identified many mass shifts for biological and chemical PTMs, but also two unannotated mass shifts of 226 Da and 94 Da at high amounts likely corresponding to the modification of interest (Table 1, full mass shift profile can be found at Supplementary Data 2a). These mass shifts matched those identified by Bae et al. Notably, the fragment remainder masses PTM-Shepherd identified for both mass shifts were nearly identical (Table 2), indicating with a high degree of likelihood that they had the same source. In this case, fragment remainder masses of 94 Da were identified from both mass shifts’ b- and y-ion series, and an additional fragment remainder mass of 77 Da (the prior remainder with a loss of ammonia) was identified from both mass shifts’ b-ions (Supplementary Data 2b). This remainder mass appeared to be diagnostic for RNA-crosslinked peptides, which we further confirmed by performing an additional search incorporating this new fragment remainder mass alongside the 94 Da mass into a labile mode search. Including this feature resulted in an additional 3.8% RNA-crosslinked PSMs over the 94 Da mass alone (Supplementary Fig. 1).

Table 1 Most prevalent mass shifts and their characteristics from an open search of a pRBS-ID experiment
Table 2 Diagnostic features for the most abundant mass shifts detected in an open search of a pRBS-ID experiment

After a more targeted search using these fragment remainder masses, we also wondered whether any additional diagnostic features might appear for the RNA-crosslinked peptides and performed a second pass at diagnostic feature mining (Supplementary Table 2). For fragment remainder masses, we recovered the two masses described above from both b- and y-ions from the intact 226 Da mass shift corresponding to the loss of five-member sugar ring (Fig. 2a) as well as an a-ion associated mass shift at 66 Da (94 Da minus 28 Da) from b-ions. Diagnostic ions can be of particular interest for future analyses, such as in ion-triggered instrument routines, even if they are left unused at the present. We found two easily explicable diagnostic ions for the intact nucleoside (Fig. 2b): an ion at 133 m/z corresponding to a dissociated ribose, the other half of the 94 Da fragment remainder mass, and an associated neutral loss of water. The partial modification loss on fragment ions was also observed from the 94 Da mass shift (Fig. 2c). However, the diagnostic ions were not diagnostic for the MS1 mass shift corresponding the nucleoside without the ribose (Fig. 2d), as with the ribose already dissociated there is nothing left to form the diagnostic ion.

Fig. 2: Characterization of 4SU fragmentation from a pRBS-ID experiment.
figure 2

a Fragment remainder masses for y-ions from the 226 Da mass shift. Remainder masses that passed PTM-Shepherd’s filtering correspond to the retention of the 94 Da fragment on the peptide. b Diagnostic ions derived from the fragmentation of the nucleoside analog from the 226 Da mass shift. c Fragment remainder masses for y-ions from the 94 Da mass shift. d Diagnostic ions from the 94 Da mass shift. Source data are provided as a Source Data file.

Data-driven discovery of diagnostic features

Glycopeptides contain labile modifications that produce rich sequences of diagnostic ions and peptide and fragment remainder masses28. We reasoned that detecting known glycopeptide fragmentation patterns would be a good way to validate our algorithm’s performance given the extensive literature characterizing glycopeptide fragmentation. To this end, we searched for glycopeptides in a large IMAC-enriched, TMT-labeled clear cell Renal Cell Carcinoma (CCRCC) dataset29. Phosphorylation enrichment by IMAC, the method employed in this publication, has been shown to simultaneously enrich glycopeptides, particularly those bearing sialic acids30,31, so the data should be rich in glycan signals. This dataset also presents two challenges: TMT-labeling is known to affect PTM fragmentation patterns due to reduced proton mobility9 and the relatively high collision energies used in this experiment cause extensive fragmentation of glycans, reducing the signal strength of typical glycan fragment ions. Because TMT is searched as fixed or variable modifications (i.e., it does not produce an open search-derived mass shift), peptides containing only TMT and no other mass shift are considered “unmodified”. Effectively, this controls for TMT-related fragmentation when determining glycopeptide fragmentation patterns, as TMT-related fragmentation patterns are present in both the glycopeptide spectra and the unmodified spectra.

We first wanted to verify that we could detect the commonly used glycopeptide-associated diagnostic ions from the MSFragger-Glyco7 search and annotation that are most likely to be present32. After discarding any mass shifts less than 50 Da we were left with 493 likely glycan mass shifts from 967,264 glyco PSMs of 9623 unique glyco peptides, each of which should be enriched for diagnostic ions associated with the N-glycan core structure12 and other monosaccharide(s) present, including sialic acid. Indeed, PTM-Shepherd successfully identifies many of the expected diagnostic ions used in glycopeptide searches and glycan identification7,32, including three known sialic-acid related oxonium ions at 274, 292, and 657 m/z (Fig. 3a, Supplementary Data 3). In addition to these, we found 12 additional ions that were diagnostic for more than 50% of glycan mass shifts. We hypothesized that these might be diagnostic ions specific to a high-collision energy environment and attempted to identify them in a data-driven manner. We used PTM-Shepherd’s diagnostic feature extraction module, which extracts intensities for user-specified ions of interest, to quantify these alongside the set of common diagnostic ions used in the MSFragger-Glyco, identifying clusters of highly correlated ions (Fig. 3b, see Methods). Known ions clustered together meaningfully, with annotated GalNAc, Hexose, HexNAc, and Phospho-Hexose ions being highly correlated with others from the same residue, lending credence to this approach’s validity. Perhaps unsurprisingly given the nature of the enrichment method, most unannotated diagnostic ions formed a large cluster with the two monomeric sialic acid oxonium ions found at 274 and 292 m/z. We selected the diagnostic ions from a subcluster (Fig. 3b, cluster 5) that was highly correlated with both oxonium ions (Supplementary Data 4) to validate individually. These ions formed a potential neutral loss series from the annotated 292 and 274 m/z oxonium ions, with successive losses of 42, 17, 18, and 30 Da. To our knowledge, manuscripts covering sialic acid fragmentation make no mention of these as diagnostic ions12,33,34, so their presence in spectra acquired at high collision energies may be of interest to other researchers when assigning sialic acids to glycan composition.

Fig. 3: Diagnostic features of IMAC-enriched glycopeptides under high energy conditions.
figure 3

a Scatterplot of recovered diagnostic ions across glyco mass shifts. Ions occurring in >50% of mass shifts are considered recurring and are included in b. b Spearman correlation clustering between diagnostic ions across all glyco spectra. Identifiable clusters are as follows: 1: GalNAc, 2: Hex, 3: Phospho-Hexose, 4: NeuAc, 5: sialic acid, 6: HexNAc monomers; 7: HexNAc including non-monomers. c Histogram of peptide remainder ions across glyco mass shifts. d. Histogram of fragment remainder ions across glyco mass shifts. Source data are provided as a Source Data file.

Aside from diagnostic ions, glycopeptides also produce an intense series of peptide remainder ions, called Y-ions in glycopeptide fragmentation nomenclature, where the peptide is intact while the modification has fragmented12. Mammalian N-glycans have a common core structure. When the core structure fragments, it produces a pattern of Y-ions with peptide remainder masses that are identical irrespective of the peptide’s or glycan’s mass and can even be used to diagnose the presence of glycopeptides6. Like the diagnostic ions discussed above, we find an expected pattern of peptide reminder masses corresponding to the N-glycan’s core’s Y-ion series (Fig. 3c). Aside from these, two peptide remainder masses that are not considered in the MSFragger-Glyco search recurred across mass shifts: +83 Da and −17 Da. The smallest glycan mass from the N-glycan core, corresponding to a single GlcNAc retained on the peptide, is 203 Da, so seeing masses smaller than that being as diagnostic for glycopeptides as the complete loss of glycan ( + 0 Da) or a single GlcNAc ( + 203 Da) was unexpected. This pattern—consisting of a cross-ring fragmentation event at the core GlcNAc and a loss of ammonia, respectively—has previously been identified as a conserved fragmentation pattern for glycopeptides35, but appears not to be used in current state-of-the-art tools6,7,36. This indicates that even for very well characterized modifications, gaps can exist between knowledge of fragmentation patterns and their use in computational tools, a disconnect that PTM-Shepherd’s automated fragmentation analysis can correct.

The final diagnostic feature we assessed for this glycan dataset is shifted fragment ion series. When the peptide and glycan have both fragmented, the glycan can leave a signature +203 fragment remainder mass on the peptide ion series12. PTM-Shepherd recovered this fragment remainder mass exactly (Supplementary Fig. 2) and with little interference from artefactual mass shifts despite the noisy nature of pairwise ion differences.

Some of the identified ions, particularly the Y-ion series of peptide remainder masses, appeared to taper off very quickly at larger masses, which is a known issue when identifying labile modifications at relatively high collision energies. We reasoned that using these extra ions in our search when they can be low-abundance or absent injects additional noise into the search results and suppresses real glycopeptide identifications. To test this, we used the fragmentation information provided by PTM-Shepherd and reduced our fragment and peptide remainder masses to only the four Y-ions appearing in >50% of glycan mass shifts. Though more careful analysis would surely yield better results, even the incorporation of a crude cutoff from a subset of the data resulted in a 4.5% increase in glyco-PSMs, proving that the fragmentation information provided by PTM-Shepherd enables researchers to tune search parameters to best suit their individual experiments.

We showed that PTM-Shepherd was sensitive to known diagnostic features for glycopeptides. New features detected by PTM-Shepherd also had chemical meaning relevant to the experimental setting, and PTM-Shepherd was able to identify unannotated sialic acid diagnostic ions for high-energy TMT experiments in a data-driven manner. Additionally, we proved that the information provided by PTM-Shepherd can be incorporated into subsequent searches to fine-tune parameters for different experimental settings.

Automated fragmentation analysis

ADP-ribosylation (ADPR) has seen a surge of interest in recent years, with many enrichment methods37,38 and instrumental techniques39 developed over the last decade to aid in its study. Despite this, specialized computation techniques have lagged behind. Fragmentation studies—necessary to design tools or workflows for the analysis of PTMs—require careful analysis and examination of individual spectra40. We believed that PTM-Shepherd’s diagnostic feature mining module could expedite fragmentation studies and reveal useful insights to their behavior. To demonstrate this, we reanalyzed ADPR-enriched data from Martello et al. 39. from peroxide-treated HeLa cells, rich in Ser-directed ADPR, and mouse liver, rich in Arg-directed ADPR.

To validate the fragmentation patterns we detected, we first cross-checked them against published ones40. As expected, we found previously annotated diagnostic ions (Fig. 4a, Supplementary Data 5a,b) corresponding to almost every expected breakpoint on the ADPR side chain (Fig. 4b). These were all found at relatively high levels among ADPRylated spectra (78–100%). Interestingly, the most intense of these ions—e.g., the adenine-derived ion at 136—was also abundant in unmodified spectra (73%) as a result of co-fragmentation of ADPR-containing peptides (Supplementary Note 1). This speaks to the robustness of PTM-Shepherd’s algorithm; even features whose presence alone is not specific to a particular mass shift can be recovered because our scoring and filtering utilizes intensity information. We also recovered additional ions that correspond to derivatives of annotated ions: an oxidized 428 m/z ion ( + 16 Da), a 348 m/z ion that has undergone a loss of water (−18 Da), and a 250 m/z ion that has undergone a loss of water (−18 Da). These ions were all far more specific to the ADPR PTM than their annotated counterparts and thus may be of interest to others studying ADPR. A final diagnostic ion of interest did not correspond to a common mass offset from an annotated ion. At 137.0458 m/z, we could not identify this ion as being a secondary product of any annotated ions. Its exact mass is strongly suggestive of a deamidation event occurring on the adenine ion at 136.0618 m/z (+0.9840 Da).

Fig. 4: Analysis of ADP-ribosylation fragmentation patterns.
figure 4

a ADP-ribosylation diagnostic ion presence in modified and unmodified spectra. Features detected in both datasets were averaged across all values. b Structure of ADP-ribosylation. Dashed lines correspond to breakpoints in the molecule. c ADP-ribosylation peptide remainder mass presence in modified and unmodified spectra. d Correlation between average intensity of diagnostic ions and their presence in unmodified spectra across both datasets. 95% confidence intervals for regression estimates are highlighted. Source data are provided as a Source Data file.

We also observed a strikingly strong relationship between an ion’s average intensity and its presence in unmodified spectra across both ADPR datasets analyzed (Fig. 4d, Spearman’s R:2 mouse = 0.857; HeLa = 0.884). We have previously commented on this phenomenon when looking at biotin-derived Cysteine probes19. In that case, reducing the isolation window and employing ion mobility gave a modest boost to diagnostic ion specificity, an effect that was presumed to be caused by reduced co-fragmentation of peptides. Additionally, we see a strong relationship between the purity of a given unmodified spectrum and the intensity of PTM-specific diagnostic ions in this dataset (Supplementary Fig. 3), lending further credence to the co-fragmentation hypothesis. It is worth noting that the issue of co-fragmented ions has been well-studied in the context of isobaric tandem mass tags41. But, to our knowledge, this is an understudied issue for biological PTMs.

PTM-Shepherd also identified both types of remainder ions in this dataset, peptide (Fig. 4c) and fragment. Of note was PTM-Shepherd’s recovery of a −42 Da peptide remainder mass from the Arg-directed ADPR dataset (Supplementary Data 5b). When Arg-linked ADPR dissociates from the peptide, it appears to frequently take a portion of the Arg side chain with it. The result is a negative peptide remainder mass corresponding to the loss of the Arg reactive group that is both prevalent (66% of PSMs) and distinguishes ADPR on Arg from other residues. This is also reflected in the fragment remainder masses. The b- and y-ion series were found to consist of 40 and 26% ions shifted by −42, respectively (Supplementary Data 5b). Since only ions downstream of the modification site are expected to be shifted, we only expect to find half of all ions containing PTM-related mass shifts. The abundance of the Arg-specific fragment ions indicates that the modification itself should be easily localizable. We also found a noteworthy number of neutral loss-associated peptide remainder ions. When ADPR fragments after the primary ribose (Fig. 4b, green), we would expect a peptide remainder mass of 114 Da if it were to remain intact. We do not find that mass, but instead find four sequential neutral losses of water from that mass. Equally of interest is that the neutral loss peaks—despite neutral losses not being unique to ADPRylated peptides—are not found in unmodified spectra. Though counterintuitive, even common losses can produce PTM-specific peaks. By thinking of them as losses of almost the entire modification and a common neutral loss, it is easier to reconcile their uniqueness to specific modifications. In other words, a −17 peptide remainder mass (Fig. 4c, red) will appear at the precursor m/z − 17 for unmodified peptides, but at precursor m/z – 558 for modified peptides.

Use cases and applicability of diagnostic features

To investigate the extent to which co-fragmentation affects diagnostic feature characteristics, we leveraged our ability to identify large numbers of glycan diagnostic ions from the CPTAC IMAC-enriched dataset. This dataset represents 117 unique diagnostic ions, each found to be diagnostic for between 1 and 493 mass shifts, for a total of 13707 data points (Supplementary Data 1). Every diagnostic ion was evaluated individually for its ability to separate glyco and unmodified spectra based on its precision and AUC (Supplementary Note 2). This was repeated for the 64 unique peptide remainder masses observed between 1 and 344 times, totaling 2261 data points.

Here, precision can be interpreted as the probability y that a spectrum is a glyco spectrum given that the diagnostic ion is present in the spectrum at intensity x (Fig. 5a). Diagnostic ion precision attenuates rapidly as the intensity increases, losing more than a third of its usefulness when it becomes the spectral base peak (average intensity 100.0). In enriched datasets such as this, that can be explained by co-fragmentation. As mentioned above, for co-fragmented spectra the probability of seeing ions from the minor product is inversely proportional to the spectral purity. For all ions in the spectrum, their probability of passing the limit of detection is proportional to their intensity. Thus, if you have a very intense ion, it can appear in a spectrum even if is coming from a minor product in a relatively high purity spectrum. In enriched datasets such as this, co-fragmented minor peptides are disproportionately likely to be PTM-bearing and produce diagnostic ions. If those ions are also very intense by nature, they will be present in many unmodified spectra. This is a trend that can be reversed by taking intensity information into account rather than only checking for the presence or absence of the ion (Fig. 5b). The AUC statistic here can be directly interpreted as the probability y that a diagnostic ion of intensity x drawn from a random modified PSM will be greater than the intensity of the same diagnostic ion drawn from a random unmodified PSM. After including intensity information, an ion’s ability to separate glyco and non-glyco spectra increases with intensity. Incorporating this feature into PTM-Shepherd allows us to detect diagnostic ions that are as ubiquitous as ADPR’s adenine ion, in 92.9% of off-target spectra in the HeLa dataset (Supplementary Data 5a). It also shows that researchers can effectively use intense diagnostic ions for scoring PTMs, but only if they empirically learn the distribution of intensities among unmodified PSMs beforehand.

Fig. 5: Trends in diagnostic remainder ions.
figure 5

a Relationship between diagnostic ions’ observed intensity and the precision of their presence. b Relationship between diagnostic ions’ observed intensity and the classification strength of their intensity as measured by AUC. Source data are provided as a Source Data file.

For peptide remainder masses, unlike diagnostic ions, precision does not attenuate with intensity (Supplementary Fig. 4). As mentioned above, peptide remainder masses are mass- (although not sequence) specific. Co-fragmented peptides can only share peptide remainder masses if they share a mass that is indistinguishable at MS/MS mass accuracy, which is not guaranteed even for co-fragmented peptides with the same charge state. Excluding noise peaks that happen to fall within the tolerance of a theoretical peptide remainder ion, there should be few erroneously matched peptide remainder masses. The result is a very specific feature that does not attenuate as it gets more intense. Accordingly, peptide remainder ions discovered by PTM-Shepherd have many applications. Experiments performed with data-independent acquisition (DIA) have many co-fragmented peptides by design and present a prime opportunity for their use. Plus, with the advent of real-time searching, peptide remainder ions can also be used for instrumental enrichment42.

Discussion

Our analyses show that PTM-Shepherd can be used to reliably identify diagnostic features for any modification of interest. In high-energy glycopeptide fragmentation, we showed that diagnostic ions for sialic acid could be identified without prior knowledge in a data-driven way, as well as finding two peptide remainder masses that had been described by experimentalists but neglected by cutting-edge glycopeptide search tools. In our discussion of a novel RNA-crosslinking workflow, we showed that we can easily automate experimental characterization in the FragPipe/PTM-Shepherd environment. Finally, our discussion of ADPR fragmentation demonstrated that fragmentation studies—traditionally done by hand with manual annotation of spectra or using custom tools10 and synthetic peptides14—could be automated and democratized to reach a broader audience and study PTMs without additional benchwork. We even found meaningful fragmentation patterns that would have been missed by annotation focused on modification structure alone. Although our analysis focused on demonstrating PTM-Shepherd’s capabilities, we also used our ability to generate diagnostic features in large numbers to better understand their nature. We showed that co-fragmentation of peptides presents a major issue for the precision of diagnostic ions in PTM analysis and explored ways to overcome it, as well as interrogating the utility of peptide and fragment remainder masses.

Automated diagnostic feature detection has wide-ranging applications across proteomics fields. Chemical probes can be characterized instantly, facilitating their development19. It could be advantageous to use these features to develop custom modification scores for localization-by-proxy strategies43 or to perform rescoring in Percolator44. Furthermore, for enriched datasets or DIA-studies, the remainder masses identified by PTM-Shepherd might be the only reliable way to definitively identify labile modifications. There are myriad ways in which understanding modification behavior aids researchers, and thus we believe that the diagnostic feature detection enabled by PTM-Shepherd will be an invaluable tool in the analysis of proteomics data.

Methods

Algorithm overview

Here we provide an overview of the algorithm used for diagnostic feature detection. Additional details are provided in the sections below. The goal for this algorithm is to identify spectral features for the PTMs identified in an analysis. We used open (or mass offset) searches to identify peptides bearing PTMs. PSMs are grouped based on their mass shift, using PTM-Shepherd’s peakpicking algorithm23, into a “mass shift bin.” To do this, MS1 mass shifts for the experiment are allocated into a histogram with 0.0002 Da width bins. The histogram is then smoothed by distributing each bin’s mass into itself and the two adjacent bins on either side using a normal distribution, and peak apexes are detected by identifying regions of the histogram with a prominence greater than 0.3 and the 500 highest signal-to-noise peaks (calculated by summing PSM counts within a 0.004 Da window of the apex and subtracting PSM counts for 0.01 Da windows on either side) are reported. Redundant PSMs are not considered for this analysis, with the highest E-Value PSM for each peptide ion within a mass shift bin being retained as a representative PSM. For each representative PSM within a mass shift bin, we calculate three features: potential diagnostic ions in the spectrum, potential peptide remainder masses in the spectrum, and potential fragment remainder masses in the spectrum (see: Spectral feature calculation). We then check to see which of these features recurs across representative PSMs within the mass shift bin (see: Identifying recurring features). Ideally, one could use these patterns to describe how the PTM in the mass shift bin fragments. However, detecting a fragmentation pattern this way does not mean that it necessarily describes the PTM within the mass shift bin. For example, immonium ions from individual unmodified amino acids will be detected as recurring features but are related to peptide fragmentation rather than PTM fragmentation. Similarly, a-ions will be detected as fragment remainder masses because they produce consistent mass shifts from b-ions. To limit the list of recovered features to only those that describe the relevant PTM, we thus need to see whether they are more abundant within the mass shift bin than in some background dataset. We reasoned that unmodified peptides (contained within the zero bin) are the best representative of a dataset’s background. Following this logic, we then test whether each recurring feature from the mass shift bin is more abundant among representative PSMs in the mass shift bin than the representative PSMs of unmodified peptides (see: Identifying significant features). Features that pass statistical and abundance filtering are reported.

Spectral feature calculation

Rather than using every PSM for what is inherently a noisy process, we select only those that are most likely to have the highest quality spectra. To do this, PSMs within each mass shift bin are first grouped by their peptide ion (sequence, modification state, and precursor charge state), then each group of PSMs has its lowest E-value representative selected for all downstream processing.

The first MS/MS spectral feature we analyze is raw spectral ions, such as immonium and oxonium ions, which we will refer to simply as diagnostic ions. All spectra from PSMs containing a given delta mass are stripped of matched a-ions, b-ions, and y-ions (by default). Spectra are also stripped of a-, b-, and y-ions that are found to be shifted by the PSM’s delta mass, preventing backbone fragments containing the modification from being counted as diagnostic ions. At this point, a spectrum can be thought of as a vector composed of m ions, where each ioni has a corresponding mzi and inti corresponding to the ion’s mass at charge state one and its intensity. All remaining ions are considered potential diagnostic ions and stored in a vector U of length m. This can be represented as

$${{{{{\bf{U}}}}}}=[\left({{{{{{\rm{mz}}}}}}}_{1},\,{{{{{{\rm{int}}}}}}}_{1}\right),\,\cdots,\,\left({{{{{{\rm{mz}}}}}}}_{m},\,{{{{{\rm{in}}}}}}{{{{{{\rm{t}}}}}}}_{m}\right)]$$
(1)

The second MS/MS spectral feature we analyze in the MS/MS spectra is peptide remainder masses. All spectra from PSMs containing a given delta mass are stripped of shifted and unshifted a-, b-, and y-ions, as described above, before precursor remainder mass calculation. A theoretical peptide mass P of charge state one is calculated based on the peptide sequence and variable modifications identified for the PSM during spectral searching but excluding any MS1 mass shift. Then, the pairwise distance d between each remaining ion in the MS/MS spectrum and the theoretical peptide mass P is calculated and stored in a vector V of length m, where m is the number of ions remaining in the spectrum after filtering. Each component Vi contains the pairwise distance between P and mzi as well as the intensity inti. This can be represented as

$${{{{{\bf{V}}}}}}\,=[{{{{{{\bf{V}}}}}}}_{1},\,\cdots,\,{{{{{{\bf{V}}}}}}}_{k}]$$
(2)

with each component

$${{{{{{\bf{V}}}}}}}_{i}=({{{{{{\rm{mz}}}}}}}_{i}-P,\,{{{{{{\rm{int}}}}}}}_{i})$$
(3)

Intuitively, each component can be interpreted as what the precursor remainder mass and intensity would be if the ith ion were a shifted precursor in the spectrum.

The third MS/MS spectral feature we analyze is fragment remainder masses. All spectra from PSMs containing a given delta mass are stripped of unshifted a-, b-, and y-ions only, allowing us to identify instances where the entire delta mass remains on the fragment ions. We reasoned that understanding how modifications affect individual ion series would provide insight into fragmentation patterns, so fragment remainder masses for b- and y-ions are calculated independently. Our procedure is similar to the procedure described by Dancik et al. 45. and reiterated by He et al. 20. For each fragment ion series, the peptide’s theoretical fragment ions of charge state one are calculated based on the peptide sequence and modifications identified for the PSM during spectral searching; the vector F holds each of n theoretical fragment ions, where n is the length of the peptide minus one and Fj corresponds to the jth fragment ion. Then, the pairwise distance between each remaining ion in the MS/MS spectrum and each theoretical fragment ion Fj is calculated and stored in a matrix W of size m by n, where m is the number of ions remaining in the spectrum. This can be represented as

$${{{{{\bf{W}}}}}}=\left[\begin{array}{ccc}{{{{{{\bf{W}}}}}}}_{11} & \cdots & {{{{{{\bf{W}}}}}}}_{1n}\\ \vdots & \ddots & \vdots \\ {{{{{{\bf{W}}}}}}}_{m1} & \ldots & {{{{{{\bf{W}}}}}}}_{{mn}}\end{array}\right]$$
(4)

with each component

$${{{{{{\bf{W}}}}}}}_{{ij}}=({{{{{{\rm{m}}}}}}{{{{{{\rm{z}}}}}}}_{i}-{{{{{\bf{F}}}}}}}_{j},\,{in}{t}_{i})$$
(5)

Intuitively, each matrix component can be interpreted as what the jth fragment’s remainder mass and intensity would be if the ith ion in the spectrum were the jth theoretical fragment’s shifted counterpart.

Identifying recurring features

We then determine which features represent the most intensity and are thus worthy of undergoing testing for enrichment. To do this, we place every value in a histogram with a bin width of 0.2 mDa spanning the range of possible features. Each feature (diagnostic ion, peptide remainder mass, and fragment remainder masses), is given its own histogram. For peptide and fragment remainder masses, the left tail of the histogram is truncated at −250 Da because values smaller than that would necessitate the losses of multiple residues. Because peaks in these histograms are generally too jagged to cleanly identify peaks, they are smoothed. This is done by calculating a rolling mean across the histogram, but one that increases in spread for heavier ions to account for their larger uncertainty in Da terms. The spread of the peak in Da is determined by

$${{{{{\rm{tol}}}}}}*\frac{\left({{{{{\rm{mas}}}}}}{{{{{{\rm{s}}}}}}}_{i}+{{{{{\rm{mas}}}}}}{{{{{{\rm{s}}}}}}}_{x}\right)}{1000000}$$
(6)

where the MS/MS spectrum tolerance is tol in ppm terms, the mass of the ion being inserted is massi, and the average mass of fragment ions or peptide ions from the mass bin is massx. The massx term allows us to properly account for the uncertainty of remainder masses. Distributing the mass without considering the peptide it came from would lead to improperly small spreads. To illustrate the issue, a peptide remainder mass of 100 Da derived from a 1000 Da peptide should have an MS/MS error calculated from 1100 Da, not 100 Da. For diagnostic ions, a mass of 150 Da is used as massx.

Peaks in the histogram are defined by descending each side of a local maximum bin until a bin with either zero intensity is reached or the next bin’s value increases. Manual calibration found that a bin-to-bin tolerance of 1% was enough to prevent noisy bins from splitting peaks in two. Peaks are then integrated by summing histogram bins within the MS/MS tolerance without regard for adjacent peak boundaries. Any peak with an integrated area greater than 0.1% (by default), representing an average intensity greater than 0.1% of the base peak, is selected for downstream analysis. A final check is performed to remove redundant peaks where the least intense of any two histogram peaks that cannot be resolved under the provided MS/MS tolerance is removed.

Identifying significant features

To find features specific to a particular mass shift, the full feature set—every major peak from the feature histograms above—needs to have features pruned from it that are not specific to the mass shift. We reasoned that peptides without mass shifts would be a good representative of a dataset’s noise, and as such testing whether features are more likely to appear among peptides with a particular mass shift than those without any mass shift would filter out non-modification-specific features.

Representative PSMs for every peptide ion with a particular mass shift and representative PSMs with no mass shift are first assembled, then every feature from the list of diagnostic ions, precursor remainder masses, and fragment remainder masses is quantified for each PSM in both lists. For spectra that do not have the diagnostic feature, the intensity is coded as a zero. Fragment remainder masses are likely to appear by chance solely based on the number of theoretical-to-experimental ion offsets calculated, so PSMs are considered to be missing a fragment remainder mass if there are fewer than two shifted ions of the ion type in the spectrum, i.e., fewer than two matching fragment remainder masses within feature matrix W for any ion type. A maximum of 1000 representative PSMs are selected per group, with a seed provided internally for reproducibility.

For every diagnostic feature tested, a series of metrics are produced for filtering noise peaks from real peaks. First, the lists of feature intensities from the unmodified and mass shifted PSMs are compared via a two-sided Mann-Whitney-U test with tie and continuity correction (adapted from the Hipparchus statistics library for Java, v1.8). E-values for each diagnostic feature are calculated by multiplying by the number of tests performed within the feature class for the current mass shift. By default, any feature with an E-value less than 0.05 is filtered out. A second metric to quantitatively assess the strength of the feature is included in PTM-Shepherd’s output: Area Under the Curve (AUC). This is commonly used as a measure of effect size for the Mann-Whitney U test and can be directly interpreted as the probability that a mass shifted PSM will have a higher intensity for this feature than an unmodified PSM. Second, we calculate a feature’s fold change of average intensity across all PSMs. Any features with fold change of less than 3.0 is filtered out by default. This metric primarily helps to identify diagnostic ions and non-specific but increased neutral losses for peptide and fragment remainders. Third, we filter out any features that are not sensitive for the modification, occurring in less than 25% of representative PSMs for diagnostic ions and peptide remainder masses. Owing to the multiple ion requirement for fragment remainder masses, this filter is reduced to 15% but is accompanied by an ion propensity filter requiring at least 12.5% of the identified ions within that series having the mass shift.

Fragment ions undergo an additional post-filtering processing step. Because a theoretical-experimental peak offset Wij is created for n theoretical ions in the theoretical ion series, a single peak in the experimental MS/MS spectrum produces a sequence specific pattern. For example, if the jth residue produces an offset with fragment Fj from the experimental ion i, the same experimental ion responsible for that offset will also produce an “echo” offset from fragment Fj-1 equal to the original offset plus the mass of the residue at position j. Similarly, it will produce an “echo” offset from fragment Fj+1 equal to the mass of residue j + 1 minus the original offset. Depending on the fragment ions containing the mass shift, some modifications can produce very weak signals for their primary mass shift but strong signals from shifted fragment ions upstream or downstream of the modification site. To correct for this, we check for residue enrichment both on and adjacent to the peptide site responsible for producing the mass shift. If any residue is found at position j + 1 for a modification more than 50% of the time, the fragment remainder mass is adjusted by subtracting that residue’s weight from the fragment remainder mass. If any residue is found at position j more than 50% of the time, the mass of residue j is added to the fragment remainder mass. With all fragments downstream of a peptide’s modification site carrying the mass shift, the residues responsible for these shifts should be roughly uniformly distributed across all 20 amino acids. Thus, any mass shift that is less prevalent than one of these adjusted offsets is unlikely to be a real peak, and reporting for fragment remainder masses is truncated after the first adjustment.

Data processing

Four datasets were used throughout this manuscript. The first consists of a single Thermo Fisher Raw file “2021-2-23_EA_296_1A_Final.raw” from ProteomeXchange repository the PXD028853. This contains a cysteine chemoproteomic probe from Yan et al. (2022)19 that was used to demonstrate the algorithm. Data was collected on a Thermo Scientific Orbitrap Eclipse Tribrid in DDA mode and processed directly in FragPipe (v18.0) without conversion to mzML. Data was searched against the Uniprot reviewed protein sequences database retrieved on 13 June 2021 with decoys and common contaminants appended. An offset search was performed in MSFragger (v3.5)5 by loading the “Mass-Offset-Common-PTMs” workflow, replacing the offset list with 0 and 463.236554, and replacing the fixed cysteine carbamidomethylation with a variable one. “Write calibrated MGF”24 was turned on for the PTM-Shepherd23 diagnostic feature mining module, and “Diagnostic Feature Discovery” in PTM-Shepherd (v2.0.0) was enabled with default parameters. Filtering to 1% PSM, peptide, and protein levels was performed by Philosopher (v4.2.2)46.

The second dataset consists of the Clinical Proteomics Tumor Analysis Consortium (CPTAC) IMAC-enriched clear cell renal cell carcinoma (ccRCC) samples29 from the CPTAC data portal47. These 299 files represent TMT-labeled solid tumor or adjacent normal tissue from 110 human ccRCC patients. Samples were acquired on a Thermo Fisher Fusion Lumos in data-dependent acquisition (DDA) mode using high-collision dissociation (HCD). Thermo Fisher raw files were converted to mzML format using Proteowizard v3.0.1139248 with vendor peakpicking enabled. The 23 TMT-plexes were separated into separate experiment folders and processed using FragPipe v18.0. For the primary analysis, the default “glyco-N-TMT” workflow was used with minor changes to account for the goals of the analysis and experimental setup. Data was searched against the Uniprot reviewed protein sequences database retrieved on 13 June 2021 with decoys and common contaminants appended. During the MSFragger5 search, two variable phosphorylation modifications were allowed on the residues STY due to the expected enrichment of phosphorylated peptides and “Write calibrated MGF”24 was turned on for the PTM-Shepherd23 diagnostic feature mining module. In PTM-Shepherd, “Assign Glycans with FDR” was disabled, and “Diagnostic Feature Discovery” was enabled with default parameters. Finally, “Isobaric Labeling-Based Quantification” with TMT-Integrator was disabled. Filtering to 1% PSM, peptide, and protein levels was performed by Philosopher46. PTM-Shepherd was then run via command line to enable the reporting of isotopic peaks.

For the secondary analysis wherein known and discovered diagnostic ions were quantified, PTM-Shepherd’s “Diagnostic Feature Extraction” module was used with the ion list presented in Fig. 2b. This was performed using the mzMLs rather than the deneutrallossed and deisotoped24 mgf files from MSFragger to prevent neutral losses that would be correlated under normal conditions from being anticorrelated in the analysis. For the tertiary analysis wherein the landscape of diagnostic features was explored, PTM-Shepherd was rerun, but with the filtering parameters for diagnostic ions and peptide ions set to 0 for “Min. % of spectra with ion” and 1 for “Min. intensity fold change.” PSM-level correlations for all PSMs with mass shifts >50 Da were computed for each diagnostic ion present in more than 50% of glycan mass shifts. Spectral ions are normalized to the base peak by default, creating nonlinear relationships between some ions and necessitating the use of Spearman’s rank correlation. Correlation was calculated using the Pandas package in Python.

The third dataset consists of a novel protocol for photoactivatable ribonucleoside-crosslinking from the ProteomeXchange repository PXD02340126. Only the two 4SU nucleotide-specific raw files from this repository were used. Samples were acquired on a Thermo Fisher Orbitrap Fusion Lumos using HCD fragmentation. Only the two 4SU-specific raw files from the repository were using in this analysis, and both samples were processing using FragPipe v18.0 directly without conversion to mzML. Samples were processed three times. The first, to find diagnostic features, was a standard open search using the FragPipe default “Open” workflow but with “Write calibrated MGF” and PTM-Shepherd’s “Diagnostic Feature Discovery” enabled with default settings. The second, to validate fragment remainder masses, was adapted from the default “Mass-Offset-CommonPTMs” workflow but with the mass offsets limited to 0, 226.0594, and 94.0168; “Labile modification search mode” enabled; “Y ion masses” and “Diagnostic fragment masses” removed; “Remainder masses” set to 94.0168 and 76.9903; “Write calibrated MGF” enabled; and PTM-Shepherd’s “Diagnostic Feature Discovery” enabled with default settings. The settings for the third analysis to validate an ammonium loss were identical to the second but without the 76.9903 fragment remainder mass. All analyses were run against the Uniprot database described above. Crystal-C49 was used to clean up open search results. Filtering to 1% PSM, peptide, and protein levels was performed by Philosopher46.

The fourth dataset consists of two samples from the ProteomeXchange repository PXD004245 corresponding to ADPR -enriched samples of mouse and HeLa origin39. The former is derived from mouse liver, processed in triplicate, and was acquired on a Thermo Fisher Orbitrap Q-Exactive Plus instrument in DDA mode using HCD. The latter was treated with H2O2 to induce oxidative stress, then collected in the same manner described above. Raw files were converted to mzML using Proteowizard v3.0.19296 with vendor peakpicking enabled. Both datasets were searched against their respective Uniprot reviewed sequence databased with decoys and common contaminants appended, with the mouse database retrieved on 27 September 2021 and the human database described above. Both datasets were searched separately in FragPipe v18.0 using the default “Labile_ADPR-ribosylation workflow with a few changes. During the MSFragger search, “Report mass shift as variable mod” was set to “No” so that PTM-Shepherd would register these ADPRs as mass shifts and “Write calibrated MGF” was enabled for the PTM-Shepherd diagnostic feature mining module. PeptideProphet50 and ProteinProphet defaults for “Offset search” were loaded, then PTM-Shepherd and its “Diagnostic Feature Discovery” Module were enabled.

One additional dataset was used to evaluate PTM-Shepherd specificity. This dataset consists of a subset of the human proteome dataset obtained from PXD01015425. Specifically, the 36 raw files corresponding to the brain tissue labeled “V102” tissue were processed. This was acquired on a Q Exactive Plus in DDA mode using HCD fragmentation. Raw files were searching directly in FragPipe v18.0 as described above using the default “Open search” workflow with PTM-Shepherd’s “Diagnostic Feature Discovery” Module enabled.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.