Introduction

Pooled screening with CRISPR technology has revolutionized the ease and scale for probing gene function1,2. Targeted loci, which are often protein-coding genes, can each have hundreds of protospacer adjacent motifs (PAMs), providing many potential single guide RNA (sgRNA) options. It is impractical to assess activity of the millions of potential sgRNAs, so accurate predictions of sgRNA activity are essential for the construction of compact yet potent libraries. Several groups have developed algorithms and web tools to facilitate sgRNA selection3.

We previously used a classification model to determine sequence features and developed on-target sgRNA design rules using data from 1841 sgRNAs (Rule Set 1)4. Rule Set 2 then improved upon this initial attempt by incorporating more training data and using a regression model that included additional sequence features5. Our guide design portal, CRISPick (broad.io/crispick), has been in continuous operation since 2014 and has averaged 168 design runs per day. Since the development of Rule Set 2, new datasets, features, and model architectures have been developed, which we wanted to include in an updated rule set.

While developing this updated rule set, we discovered that small variations in the sequence of the trans-activating CRISPR RNA (tracrRNA) can lead to large differences in activity (here, we will refer to the region of the sgRNA that base pairs to the target DNA as the ‘spacer’ and the remaining structural element of the sgRNA as the ‘tracrRNA’). The majority of published on-target models were trained with sgRNAs that use the tracrRNA as implemented in Hsu et al.6, however, several other tracrRNAs have been used for screening. Chen et al. modified the Hsu tracrRNA with both a flip—a T to A substitution and compensatory A to T substitution, to disrupt the run of 4 thymidines that can trigger RNA polymerase III termination – and an extension of 5 base pairs in the tetra-loop that is hypothesized to stabilize the sgRNA structure7,8. We have also conducted screens with a modification to the Hsu tracrRNA that disrupts the Pol III termination site with a T to G substitution and compensatory A to C substitution, but without any tetra-loop extension (herein named the DeWeirdt tracrRNA)9.

To account for the differential effects of these tracrRNA variants, we incorporated tracrRNA identity as a feature in our rule set. Furthermore, to validate our model and gain a deeper understanding for how the different tracrRNAs affect activity, we generated a new dataset with tens of thousands of sgRNAs tiling across essential and non-essential genes for each tracrRNA variant. We show that our updated model, Rule Set 3 (Sequence + Target), makes optimal predictions for all three tracrRNA variants. Additionally, the new screening data demonstrate that disrupting the Pol III termination signal present in the Hsu tracrRNA improves activity for a subset of spacer RNAs, suggesting that the Chen or DeWeirdt tracrRNA may be preferable when target density is a priority, such as base editing screens10,11,12, or when direct detection of the sgRNA is necessary to interpret screening results, such as in some scRNAseq approaches13. We expect these results to improve CRISPR-Cas9 screening performance in addition to providing mechanistic insight into tracrRNA-dependent differences in sgRNA activity.

Results

Comparison of on-target models

To understand the current state-of-the-art for on-target modeling, we identified four recently published models14,15,16,17, and to evaluate these models we collated datasets generated by three different experimental approaches: three genome-wide datasets, four tiling datasets, and four integrated-target datasets4,5,15,16,18,19,20,21,22,23 (Supplementary Data 1). Due to the strong interdependence between the models and the collated datasets, we compared the models in a pairwise fashion, allowing us to retain the maximum number of spacer sequences for testing while avoiding leakage between training and testing sets (Fig. 1a). Spearman correlation was calculated between the observed and predicted activity to assess performance. By this metric, the best performing model was CRISPRon (Supplementary Fig. 1a).

Fig. 1: Development of Rule Set 3 (Sequence + Target).
figure 1

a Fraction overlap between sgRNAs used for training Rule Set 3 and those used for training other on-target models. Edge width and color are proportional to the fraction overlap. Node size is proportional to the number of sgRNAs. b Schematic depicting Rule Set 3 (Sequence + Target) development. Nucleotide differences in tracrRNA sequences are colored. Models were trained only on the train set, as indicated by blue outlines. Italics indicate features for which information was obtained from existing databases. c SHAP feature importance for the 20 most important features in Rule Set 3 (Sequence). Each point represents one sgRNA from the training set. Descriptions of model features can be found in Supplementary Data 3. d Histograms of SHAP values for sgRNAs, colored by guanine status in the 20th sgRNA position and split by tracrRNA identity. e SHAP feature importance for the 20 most important features in Rule Set 3 (Target). Each point represents one sgRNA from the training set. Descriptions of model features can be found in Supplementary Data 3. Source data are provided as a Source Data file.

We were surprised that Rule Set 2 marginally outperformed VBC Activity (average difference in Spearman correlation = 0.02), since VBC Activity incorporates Rule Set 2 scores in addition to training on data from Munoz et al. The only dataset where VBC Activity outperformed Rule Set 2 was from Behan et al., which, along with the Munoz data, was one of two collated datasets that utilized the tracrRNA from Chen et al. (Supplementary Data 1). This observation led us to hypothesize that there are systematic differences in sgRNA activity that depend on the tracrRNA sequence. To investigate this, we analyzed sgRNA activities from two massive screening efforts, the Dependency Map projects at the Broad and Sanger Institutes24. The Broad dataset was screened with the Avana library using the Hsu tracrRNA, while the Sanger dataset was screened with the Human CRISPR Library v.1.0/1.1 and the Chen tracrRNA. These datasets had 876 spacer sequences in common that target essential genes, enabling an assessment of tracrRNA-dependent effects on sgRNA activity. To understand whether there were predictable differences in activity between the tracrRNA variants, we trained gradient boosting models on five splits of the overlapping sgRNAs and measured the Pearson correlation for predictions on held out folds. The average Pearson correlation between the predicted and observed activity differences was 0.34 (Supplementary Fig. 1b), suggesting that tracrRNA identity should be included as a feature for on-target modeling.

Construction of on-target model

To build an updated on-target model that can make optimal predictions for multiple tracrRNA variants, we selected seven datasets for training. This totaled 46,526 unique context sequences, defined as the 20 nucleotide sequence that matches the spacer plus four nucleotides preceding the spacer and six nucleotides (PAM + 3 nucleotides of context) succeeding the spacer RNA; 45% of sequences utilized the Chen tracrRNA. We also held out six datasets for testing (23,629 unique context sequences; 31% with the Chen tracrRNA) (Supplementary Data 1, Supplementary Data 2). While convolutional neural networks have proven effective for predicting sgRNA activity15,16, we opted for a gradient boosting framework for faster training times25. For each sgRNA we encoded the 30mer context sequence using all the features from Rule Set 2 in addition to features to indicate the longest run of each nucleotide in the sgRNA, the melting temperature of the sgRNA:DNA heteroduplex26, and the minimum free energy of the folded spacer sequence27. We also incorporated categorical variables to indicate which tracrRNA was associated with each spacer, allowing the model to learn features that interact with the tracrRNA (Fig. 1b). We featurized all of the sgRNAs in the training set and fit a gradient boosting regressor to predict z-scored activity values. We refer to this model as Rule Set 3 (Sequence).

Evaluation of on-target scoring criteria

To understand how Rule Set 3 (Sequence) makes its predictions, we calculated Shapley additive explanation (SHAP) values28. We found that a G in the tracrRNA-adjacent 20th position of the spacer sequence was the most important feature for activity as has been observed previously (Fig. 1c)21, although there was a strong interaction with the tracrRNA feature, as sgRNAs with the Chen tracrRNA were less affected by the presence of a G in this position (Fig. 1d). Notably, all three feature classes that were newly added to this model based on prior literature—poly(T), spacer:DNA melting temperature, and minimum free energy – were among the 20 most important features (Fig. 1c, Supplementary Data 3). Likewise, tracrRNA identity also proved to be relevant, validating its inclusion in the model. When we considered the held-out datasets, Rule Set 3 (Sequence) had the highest Spearman correlation on three of the six datasets, including Behan 2019, which used the Chen tracrRNA (Supplementary Fig. 1c). Rule Set 3 (Sequence) predictions were modestly correlated to Rule Set 2 scores for test sgRNAs that used the Hsu tracrRNA (Pearson r = 0.69) (Supplementary Fig. 1d).

Several groups have incorporated information about the protein-coding sequence, such as protein domains, amino acid sequence, evolutionary conservation, and protein length to predict sgRNA activity5,17,29. To test whether these target-site features improve Rule Set 3 (Sequence) scores, we filtered for training data targeting endogenous sites. We calculated the residual activity from Rule Set 3 (Sequence) scores as the outcome variable for the target-site model, Rule Set 3 (Target), ensuring that predictions from the trained model would be additive with Rule Set 3 (Sequence) scores. To test each set of target-site features we split training data into five folds for cross-validation and trained gradient boosting regression models to predict residual activity (Supplementary Data 1). Here, target-site features such as amino acid abundance and conservation refer to features of the protein-coding region of the gene targeted by the sgRNA and we define activity as the likelihood of disrupting protein function. First, we tested whether protein domains are predictive of sgRNA activity. We queried genes in our training set for functional annotations in Ensembl’s REST API30. In total, we obtained functional annotations from 16 sources, including Pfam, Smart, PROSITE, Gene3D, and MobiDB-lite31,32,33,34,35. We then tested if the relative abundance of amino acids around the sgRNA cut site was predictive of sgRNA activity. To determine the optimal amino acid window for generating predictions, we tested widths of 2, 4, 8, 16, and 32 amino acids around the cut site. We saw that a width of 16 amino acids led to the highest Spearman correlation with an average value of 0.19 (Supplementary Fig. 2a). To evaluate the predictive power of evolutionary conservation, we obtained phyloP scores from the UCSC Genome Browser36,37. We tested combinations of small and large nucleotide windows around the cut site with the goal of capturing conservation at the cut site as well as more global features such as functional domains. We averaged conservation across 2, 4, or 8 nucleotides for the small window and 16, 32, or 64 nucleotides for the large window. We found that a small width of 4 and a large width of 32 had the best predictive power with an average Spearman correlation of 0.11 (Supplementary Fig. 2b).

Calculating SHAP values from the trained model, we saw that conservation around the cut site was the most important feature (Fig. 1e), suggesting that targeting a conserved region of a gene improves sgRNA activity. The second most important feature was the proportion of amino acids around the cut site that are typically found in an alpha-helix (V, I, Y, F, W, L). Michlits et al. noted the favorability of each of these amino acids individually17, which we have combined into a single feature. The third most important feature was the relative position of the cut site, where targeting past 85% of the coding sequence led to a steep decrease in sgRNA activity (Fig. 1e), an effect that has been observed previously5. We also examined protein domains38; five protein domain sources were among the 20 most important features (Smart, Pfam, PROSITE profiles, Gene3D, and MobiDB-lite), where sgRNAs targeting within an annotated region were more active, with the exception of MobiDB-lite, which identifies long intrinsically disordered regions. Although the relative abundances of seven different amino acids were among the top 20 features, we were unable to identify a biochemical property that explained their importance (Supplementary Fig. 2c). In fact, the strongest correlate was the fraction of adenine in the codon sequences for an amino acid, potentially indicating that these features were used as a correction to the Rule Set 3 (Sequence) scores after removing a portion of training data that targeted exogenously integrated sites. We tested the target model in combination with the sequence model, which we refer to as Rule Set 3 (Sequence + Target), on held out datasets (Fig. 1b). Target scores improved the Spearman correlation for all test sets relative to sequence scores alone, with an average improvement of 6.7% (Supplementary Fig. 2d).

Validation of model and examination of tracrRNA interaction

To validate Rule Set 3 and gain mechanistic insight into tracrRNA-dependent differences in spacer activity, we designed a tiling library targeting a subset of essential genes39 and non-essential genes40. We generated lentiviral vectors that varied in their use of the Hsu, Chen, or DeWeirdt tracrRNA, and each of the three libraries were screened in triplicate in A375 (melanoma) cells stably expressing Cas9. After three weeks, we collected cells, isolated genomic DNA, retrieved the library by PCR, and performed Illumina sequencing to determine the abundance of each sgRNA (Fig. 2a, Supplementary Data 4).

Fig. 2: Rule Set 3 (Sequence + Target) validation.
figure 2

a Schematic depicting essential/non-essential tiling library construction and screening approach. b Spearman correlations between observed and predicted activity for all essential genes (n = 201) in each of the essential/non-essential screens across previous models and Rule Set 3 models. Rule Set 3 models using the same tracrRNA feature as the screen are highlighted in pink and significantly outperformed other models (one-sided t-test p value < 0.002). Boxes show 25th and 75th percentiles as minima and maxima and the center represents the median; whiskers show the 5th and 95th percentiles. c Percent of all sgRNAs targeting essential and non-essential genes broken down by Rule Set 3 (Sequence + Target) bins for the Hsu and Chen tracrRNAs. d Spearman correlations between predicted scores and the growth phenotype for sgRNAs (n = 1964) targeting essential genes in a tiling CRISPRi dataset across all sequence models. e Percent of quintiles for sgRNAs from essential genes (n = 199) with at least 20 guides binned by Rule Set 3 (Sequence + Target) scores for each screen. f Average log-fold change of 4 sgRNAs per gene for essential genes (n = 201) and non-essential genes (n = 198) calculated by picking sgRNAs randomly, using Rule Set 2 or using Rule Set 3 (Sequence + Target) for the tiling library screened with Hsu, Chen, and DeWeirdt tracrRNAs. Rule Set 3 (Sequence + Target) scores used are with the matched tracrRNA. For the screen performed with the DeWeirdt tracrRNA, we used on-target scores with the Chen tracrRNA. Boxes show 25th and 75th percentiles as minima and maxima and the center represents the median; whiskers show 10th and 90th percentile. Heatmap shows the corresponding SSMD scores. Source data are provided as a Source Data file.

To interpret the results, we first calculated log2-fold-changes (LFCs) compared to the initial library abundance, as determined by sequencing the plasmid DNA (pDNA). As replicates were well correlated (Pearson r = 0.77 − 0.89), we averaged LFCs within each screen (Supplementary Fig. 3a). Average LFCs across screens were also well correlated, with the Chen and DeWeirdt tracrRNAs showing the highest correlation (Pearson r = 0.89; Supplementary Fig. 3b). To compare screening performance across tracrRNAs, we calculated the receiver operating characteristic area under the curve (ROC-AUC) defining sgRNAs targeting essential genes as positive controls and non-essential genes as negative controls. The Chen and DeWeirdt tracrRNAs both achieved ROC-AUCs of 0.82, while the Hsu tracrRNA had an ROC-AUC of 0.76 (Supplementary Fig. 3c). The similarity in performance of the Chen and the DeWeirdt tracrRNAs suggests that the presence or absence of a T in the 5th position has a larger effect on sgRNA activity than the stem extension. We also examined the distribution of various target categories to assess if the controls behaved as expected. In all three screens, the non-targeting controls were tightly distributed and the intergenic controls showed a cutting effect (Supplementary Fig. 3d).

To evaluate the performance of Rule Set 3 as well as other models on this new dataset, we first removed all spacer sequences that had been seen by any of these models, and then within each essential gene we calculated the Spearman correlation between the predicted scores and the observed sgRNA activity. Rule Set 3 (Sequence + Target) significantly outperformed all other models (t-test p value < 0.002) when the correct tracrRNA was specified (Fig. 2b, Supplementary Data 5). We saw better performance when using the Chen tracrRNA as an input to Rule Set 3 when predicting spacers paired with the DeWeirdt tracrRNA, highlighting the importance of disrupting the Pol III termination signal present in the Hsu tracrRNA. Conversely, we saw Rule Set 3 performance decrease when we specified the incorrect tracrRNA. Interestingly, target scores were relatively more helpful than sequence scores for the Chen and DeWeirdt tracrRNAs, although there was large variation across genes, suggesting that sgRNA selection heuristics that over-emphasize target features at the expense of sequence features may lead to poorer performance.

Rule Set 3 (Sequence) had higher Spearman correlations when predicting sgRNAs with the Hsu tracrRNA as opposed to the Chen or DeWeirdt tracrRNAs. This is likely due to a multitude of factors, including a greater diversity of datasets that have the Hsu tracrRNA in the training data, as well as features that are inherently more powerful discriminators for the Hsu tracrRNA. One example of such a feature is having G in the last nucleotide position of the spacer sequence, which has a greater effect on sgRNA activity for the Hsu tracrRNA than the Chen tracrRNA (Fig. 1d). Further, Rule Set 3 (Sequence) predicts more sgRNAs at the extreme ends of the activity spectrum for the Hsu tracrRNA than the Chen tracrRNA (16% and 9% of sgRNAs with |z-score|>1 for the Hsu and Chen tracrRNAs respectively; Fig. 2c, Supplementary Data 5), suggesting the model can identify more discriminating features for the Hsu tracrRNA. We also examined the performance of Rule Set 3 (Sequence) for CRISPRi using a previously published tiling CRISPRi dataset41 with sgRNAs targeting the Hart-Moffat essential gene set to assess on-target activity. We calculated the Spearman correlation of the predicted scores and the measured growth phenotype and saw Rule Set 3 (Sequence) performs substantially better than other models, indicating the robustness of Rule Set 3 for predicting other perturbation modalities (Fig. 2d), although we note that robust CRISPRi predictions should also take into account additional features, such as distance from the transcription start site42,43,44.

Returning to the newly generated tiling dataset, to calibrate our understanding of the model outputs, for each essential gene we divided the observed LFCs for each sgRNA into quintiles. We then compared the observed quintiles with predicted activities and saw a direct relationship between the percent of active sgRNAs and Rule Set 3 (Sequence + Target) scores for all tracrRNAs (Fig. 2e, Supplementary Data 5). For the sgRNAs with the lowest predicted activity, 87.7%, 74.9%, and 82.0%, were observed to be in the lowest or second lowest activity quintile for the Hsu, Chen, and DeWeirdt tracrRNAs respectively. Conversely, for the sgRNAs with the highest predicted activity, 77.0%, 69.0%, and 75.5%, were observed to be in the highest or second highest activity quintile for the Hsu, Chen, and DeWeirdt tracrRNAs, respectively. The large separation between sgRNAs that were observed to be active versus inactive in the highest and lowest predicted bins suggests that sgRNAs picked de novo have a high likelihood of generating gene knockouts for all tracrRNA variants.

To assess how Rule Set 3 (Sequence + Target) impacts library performance in a genome-wide screening context, we simulated a library by picking 4 sgRNAs randomly, using Rule Set 2, or Rule Set 3 (Sequence + Target) with the matched tracrRNA. We observed that across all three screens, as expected, random picking of sgRNAs showed the least separation between average LFCs of essential and non-essential genes (Fig. 2f, Supplementary Data 5). This separation increased when Rule Set 2 or Rule Set 3 (Sequence + Target) was used to pick sgRNAs. We quantitated this separation by calculating the strictly standardized mean difference (SSMD) between the essential and non-essential genes for each model; a higher SSMD indicates greater separation between the essential and non-essential genes. For all three simulated screens, Rule Set 3 (Sequence + Target) had the highest SSMD. When comparing across tracrRNAs, we found that guides screened with the Hsu tracrRNA and picked using Rule Set 3 (Sequence + Target) had the highest overall SSMD, showing that although there are fewer active sgRNAs with the Hsu tracrRNA (Supplementary Fig. 3c), by picking a highly active subset of sgRNAs, one can achieve high screening performance.

To understand how each tracrRNA affects spacer activity, we subtracted the z-scores for sgRNAs screened with the Chen tracrRNA from matched sgRNAs screened with the Hsu tracrRNA from the new tiling library. We then trained a gradient boosting model to predict the activity difference from sequence features. We saw that T in spacer positions 17–20 led to relatively lower activity when paired with the Hsu tracrRNA (Fig. 3a, Supplementary Data 6). On the other hand, G in these same positions led to relatively higher activity for the Hsu tracrRNA. We also observed that spacer RNAs that were predicted to have highly stable secondary structures, as indicated by Gibbs free energy, were less likely to be active when paired with the Hsu tracrRNA.

Fig. 3: Analysis of differences in tracrRNA activity.
figure 3

a SHAP feature importance for the 10 most important features for predicting the difference between sgRNAs screened with the Hsu versus Chen tracrRNA. Each point represents one sgRNA from the training set. Descriptions of model features can be found in Supplementary Data 3. b Box plot of the difference in z-score log-fold changes between sgRNAs screened with the Hsu versus Chen tracrRNA as a function of G/T presence in positions 17–20 of each spacer sequence. The ‘~’ symbol indicates the nucleotide is not present in this range. Box minima and maxima represent the 25th and 75th percentile, whisker minima and maxima represent the 25th and 75th percentile minus and plus 1.5× the interquartile range, respectively, and the box centers represent the 50th percentile of data. c Box plot of the difference in z-score log-fold changes between sgRNAs screened with the Hsu versus Chen tracrRNA as a function of G/T abundance in positions 17-20 of each spacer sequence. Box minima and maxima represent the 25th and 75th percentile, whisker minima and maxima represent the 25th and 75th percentile minus and plus 1.5× the interquartile range respectively, and the box centers represent the 50th percentile of data. d Same as (b) but for the Hsu and DeWeirdt tracrRNAs. e Same as (c) but for the Hsu and DeWeirdt tracrRNAs. f Data from Mimitou et al. Comparison of sgRNA relative abundance when read out via gDNA or direct sequencing. Spacers are binned by G/T abundance in positions 17–20. Source data are provided as a Source Data file.

Arimbasseri and colleagues have shown that a stretch of four or more T’s can terminate Pol III transcription, with longer stretches of T’s having stronger effects on transcription termination45. The Hsu tracrRNA has a run of four T’s in positions 2–5, whereas the Chen tracrRNA only has three T’s in positions 2–4. In this context, the negative impact that 3′-end T’s have on spacers paired with the Hsu tracrRNA suggests that these spacers are more likely to have a premature termination signal when paired with the Hsu tracrRNA than with the Chen tracrRNA. In support of this hypothesis, a recent study from Graf et al. showed diminished Pol III transcription in vitro for two sgRNAs that each had two T’s at the 3′ end of the spacer46. Furthermore, they showed that changing the fourth T in the poly-T run of the Hsu tracrRNA to an A restored sgRNA activity.

To investigate whether G’s at the 3′ end of spacer sequences have an attenuating effect on premature transcription termination, we binned sgRNAs by whether they had a G or a T in the last four nucleotides of the spacer. We saw that spacers that had a T and no G had significantly lower activity with the Hsu tracrRNA than spacers that had both a T and G in this region (mean difference = 0.56, p < 0.01; Fig. 3b, Supplementary Data 6). While spacers with a G and no T were slightly more active than spacers with neither a G nor a T, this difference was smaller (mean difference = 0.16), suggesting that G’s increase the relative activity of spacers paired with the Hsu tracrRNA primarily by inhibiting T-dependent transcription termination signals, as opposed to endowing spacers with some T-independent activity advantage. When we further discretized these bins by the number of T’s and G’s in nucleotides 17–20, we saw that the attenuating behavior of G was sensitive to the number of both T’s and G’s in this region (Fig. 3c, Supplementary Data 6). We recapitulated these observations when taking the difference in spacer activity between the Hsu and DeWeirdt tracrRNAs (Fig. 3d, e, Supplementary Data 6), demonstrating that the observed effects are independent of the stem loop extension present in the Chen tracrRNA.

To solidify the connection between 3′ end spacer G/T content and sgRNA expression we analyzed ECCITE-seq data, which captures sgRNA sequence abundance directly47. In particular, we analyzed 23 sgRNAs that use the Hsu tracrRNA and were sequenced using both gDNA and direct sgRNA sequencing. We calculated log-fold changes between gDNA and direct sequencing and saw that sgRNA levels significantly decreased as the number of Ts at the end of the spacer increased, controlling for G abundance (linear regression coefficient = −0.53, 95% CI = [−0.848, −0.210], p value < 0.01; Fig. 3f, Supplementary Data 6). Conversely, G content tended to enhance transcription, albeit not to a significant level (linear regression coefficient = 0.19; 95% CI = [−0.102, 0.490], p value = 0.19). Thus, direct sgRNA sequencing supports the hypothesis that use of the Hsu tracrRNA leads to reduced sgRNA expression as a function of G and T prevalence at the end of the spacer sequence.

Discussion

Here we present Rule Set 3, an optimal model for predicting sgRNA activity for multiple tracrRNA variants. We validate this model on a dataset tiling essential and non-essential genes. By analyzing the differences in activity across multiple tracrRNA variants, we conclude that early Pol III termination is the primary determinant of activity differences between the Hsu and Chen/DeWeirdt tracrRNAs. That a Pol III dependent feature has such a strong impact on sgRNA activity explains a long-standing observation that in vitro transcribed sgRNAs used especially in zebrafish systems are poorly predicted by models trained on results from mammalian cell models48. Similarly, we expect Rule Set 3 to generalize less well to sgRNAs transcribed from a Pol II promoter. As more CRISPR screening data from diverse contexts become available and as transfer learning approaches from machine learning improve, developing models that generalize to a multitude of screening contexts represents a promising direction for future research.

Methods

Vectors

pLX_311-Cas9 (Addgene 96924): SV40 promoter expresses blasticidin resistance; EF1a promoter expresses SpyoCas9.

All guide vectors are derivatives of the lentiGuide vector, with modifications to the tracrRNA. All guide vectors contain the EF1a promoter and puromycin resistance.

pRDA_118 (Addgene 133459): U6 promoter expresses customizable SpCas9 guide with the DeWeirdt (2020) tracrRNA.

pRDA_651: U6 promoter expresses customizable SpCas9 guide with the Hsu (2013) tracrRNA.

pRDA_652: U6 promoter expresses customizable SpCas9 guide with the Chen (2013) tracrRNA.

Cell lines and culture

A375 cells were obtained from Cancer Cell Line Encyclopedia at the Broad Institute. HEK293Ts were obtained from ATCC (CRL-3216). All cells regularly tested negative for mycoplasma contamination and were maintained in the absence of antibiotics except during screens and lentivirus production, during which media was supplemented with 1% penicillin–streptomycin. Cells were passaged every 2–4 days to maintain exponential growth and were kept in a humidity-controlled 37 °C incubator with 5.0% CO2. Media conditions and doses of polybrene, puromycin, and blasticidin were as follows, unless otherwise noted:

A375: RPMI + 10% fetal bovine serum (FBS); 1 μg/mL; 1 μg/mL; 5 μg/mL

HEK293T: DMEM + 10% heat-inactivated FBS; N/A; N/A; N/A

Essential/non-essential tiling library design

Two hundred one essential and 198 non-essential genes were randomly chosen from the standard set of essential40 and non-essential genes39. All possible sgRNA sequences tiling these genes were designed using CRISPick. The library was filtered to exclude any sgRNAs with BsmBI sites or poly-T sequences. The library was not filtered for promiscuous sgRNAs to enable future efforts focused on off-target analysis using the non-essential genes, but promiscuous guides were excluded from analysis. We also included 1000 controls targeting intergenic sites in the human genome and 1000 non-targeting sgRNAs, resulting in a total library size of 84,609 sgRNAs.

Library production

Oligonucleotide pools were synthesized by CustomArray (GenScript). BsmBI recognition sites were appended to each sgRNA sequence (represented here as the run of 20 Ns) along with the appropriate overhang sequences for cloning into the sgRNA expression plasmids. The final oligonucleotide sequence was thus: CAGCGCCAATGGGCTTTCGACGTCTCACACCGNNNNNNNNNNNNNNNNNNNNGTTTCGAGACGCGACAGGCTCTTAAGCGGCT.

Primers CAGCGCCAATGGGCTTTCGA; CGACAGGCTCTTAAGCGGCT were used to amplify the pool using 25 μL 2× NEBnext PCR master mix (New England Biolabs), 2 μL of oligonucleotide pool (~40 ng), 5 μL of primer mix at a final concentration of 0.5 μM, and 18 μL water. PCR cycling conditions: (1) 98 °C for 30 s; (2) 53 °C for 30 s; (3) 72 °C for 30 s; 24 cycles.

The resulting amplicons were PCR-purified (Qiagen) and cloned into the library vector via Golden Gate cloning with Esp3I (Fisher Scientific) and T7 ligase (Epizyme); the library vector was pre-digested with BsmBI (New England Biolabs). The ligation product was isopropanol precipitated and electroporated into Stbl4 electrocompetent cells (Invitrogen) and grown at 30 °C for 16 h on agar with 100 μg/mL carbenicillin. Colonies were scraped and plasmid DNA (pDNA) was prepared (HiSpeed Plasmid Maxi, Qiagen). To confirm library representation and distribution, the pDNA was sequenced.

Lentivirus production

For pooled library production, 24 h before transfection, 18 × 106 HEK293T cells were seeded in a 175 cm2 tissue culture flask in 25 mL of DMEM + 10% heat-inactivated FBS. Transfection was performed using TransIT-LT1 (Mirus) transfection reagent according to the manufacturer’s protocol. Briefly, one solution of Opti-MEM (Corning, 6 mL) and LT1 (305 μL) was combined with a DNA mixture of the packaging plasmid pCMV_VSVG (Addgene 8454, 5 μg), psPAX2 (Addgene 12260, 50 μg), and 40 μg of the transfer vector (e.g. the library pool). The solutions were incubated at room temperature for 20–30 min, then the transfection mixture was added dropwise to the surface of the HEK293T cells. Flasks were transferred to a 37 °C incubator for 6–8 h, after which the media was removed and replaced with DMEM + 10% FBS media supplemented with 1% BSA. Virus was harvested 36 h after this media change.

Derivation of stable cell lines

In order to establish the Cas9 expressing cell line for screens with the essential/non-essential tiling library, A375 cells were transduced with pLX_311-Cas9 and successfully transduced cells were selected with blasticidin for a minimum of 2 weeks. Cells were removed from blasticidin for at least one passage before transduction with the library.

Pooled screens

For pooled screens, cells were transduced in three biological replicates with the lentiviral library. Transductions were performed at a low multiplicity of infection, using enough cells to achieve a representation of at least 500 transduced cells per sgRNA. Cells were plated in polybrene-containing media with 3 × 106 cells per well in a 12-well plate. Plates were centrifuged for 2 h at 931 × g, after which 2 mL of media was added to each well. Plates were then transferred to an incubator for 12–18 h, after which cells were pooled into flasks. Puromycin was added 2 days post-transduction and maintained for 5 days to ensure complete selection for transduced cells. Upon puromycin removal, cells were passaged every 2–3 days for an additional 2 weeks at a minimum of 500x representation, at which point, 21 days post-transduction, cells were collected for subsequent processing. Cell counts were taken at each passage to monitor growth.

Genomic DNA isolation and sequencing

Genomic DNA (gDNA) was isolated using the KingFisher Flex Purification System with the Mag-Bind® Blood & Tissue DNA HDQ Kit (Omega Bio-Tek). The gDNA concentrations were quantitated by Qubit.

For PCR amplification, gDNA was divided into 100 μL reactions such that each well had at most 10 μg of gDNA. Plasmid DNA (pDNA) was also included at a maximum of 100 pg per well. Per 96-well plate, a master mix consisted of 150 μL DNA Polymerase (Titanium Taq; Takara), 1 mL of 10x buffer, 800 μL of dNTPs (Takara), 50 μL of P5 stagger primer mix (stock at 100 μM concentration), 500 μL of DMSO (if used), and water to bring the final volume to 4 mL. Each well consisted of 50 μL gDNA plus water, 40 μL PCR master mix, and 10 μL of a uniquely barcoded P7 primer (stock at 5 μM concentration). PCR cycling conditions were as follows: (1) 95 °C for 1 min; (2) 94 °C for 30 s; (3) 52.5 °C for 30 s; (4) 72 °C for 30 s; (5) go to (2), ×27; (6) 72 °C for 10 min. PCR primers were synthesized at Integrated DNA Technologies (IDT). PCR products were purified with Agencourt AMPure XP SPRI beads according to manufacturer’s instructions (Beckman Coulter, A63880), using a 1:1 ratio of beads to PCR product. Samples were sequenced on a HiSeq2500 HighOutput (Illumina) with a 5% spike-in of PhiX.

On-target modeling

All read count data were transformed to log-fold changes using the poola package (version 0.0.7; https://github.com/gpp-rnd/poola) in Python (version 3.8). For each screen, we selected sgRNAs that were expected to have a phenotype (e.g. sgRNAs targeting essential genes in a viability screen). We filtered any sgRNA that had more than one perfect match in the coding genome. All activities were transformed using the yeo-johnson transformation from scikit-learn (version 0.24.2) and z-scored. Finally, we changed the sign of all activity measurements so the most active sgRNAs had the most positive activity scores. All processed training and testing data can be found on GitHub: https://github.com/gpp-rnd/rs_dev/tree/main/data/processed.

To build Rule Set 3 (Sequence) we used 46,526 unique context sequences from seven datasets. For each sgRNA we encoded the 30mer context sequence using all the features from Rule Set 2 in addition to features to indicate the longest run of each nucleotide in the sgRNA, the melting temperature of the sgRNA:DNA heteroduplex26, and the minimum free energy of the folded spacer sequence27. We also incorporated categorical variables to indicate which tracrRNA was associated with each spacer, allowing the model to learn features that interact with the tracrRNA. sgRNA features were extracted using the custom Python package sglearn (version 1.2.3; https://github.com/gpp-rnd/sglearn). This package relies on biopython to extract biochemical information about sgRNA sequences49.

To fit an optimal gradient boosting model from sequence features, we used the gradient boosting framework from LightGBM (version 3.2.0)30 and tuned hyperparameters using Tree Structured Parzen Estimators from Optuna (version 2.7.0)50. We tuned the number of leaves (between 8 and 256) and minimum number of samples in a child (between 8 and 256) over 50 hyperparameter iterations. We fixed the learning rate to be 0.01 and used 5000 boosted trees. All other parameters were kept default. To evaluate each set of hyperparameters we split our dataset into five folds using the StratifiedGroupKFold splitter from scikit-learn. We split the data such that all sgRNAs targeting a gene were either in the train set or the test set for each fold. We also tried to represent each dataset source (e.g. Doench 2016 sgRNAs) proportionally in all of the folds, such that each test set had some sgRNAs from each source. We found that a model with a maximum of 111 leaves per base estimator and a minimum of 199 samples per child performed best. We used these hyperparameters to train our final model.

To build the target model, we used Ensembl’s REST API to query the amino acid sequence around the cut site of each sgRNA (accessed August 9, 2021). We used biopython to get biochemical properties of these amino acid sequences (version 1.79). Ensembl’s REST API was also used to obtain protein domain features. We used the UCSC genome browser’s REST API to get PhyloP conservation scores for each sgRNA (accessed August 9, 2021)36,51. All of these features can be generated in Python using the rs3 package (https://github.com/gpp-rnd/rs3). We tuned hyperparameters for the target model using the same pipeline as Rule Set 3 (Sequence). We found that a model with a maximum of 8 leaves per base estimator and a minimum of 137 samples per child performed best. We used these hyperparameters to train our final model.

To rerun the modeling pipeline, reference the github repository here: https://github.com/gpp-rnd/rs_dev. CRISPick has been updated to incorporate Rule Set 3 (Sequence + Target) scores (broad.io/crispick). Feature importances were calculated using the shap package in Python (version 0.39)28.

Screen analysis

Guide sequences were extracted from sequencing reads by running the PoolQ tool with the search prefix “CACCG” (https://portals.broadinstitute.org/gpp/public/software/poolq). Reads were counted by alignment to a reference file of all possible guide RNAs present in the library. Reads were then assigned to a condition (e.g., a well on the PCR plate) on the basis of the 8 nt index included in the P7 primer. Following deconvolution, the resulting matrix of read counts was first normalized to reads per million within each condition by the following formula: read per guide RNA / total reads per condition x 1e6. Reads per million was then log2-transformed by first adding one to all values, which is necessary in order to take the log of sgRNAs with zero reads.

Prior to further analysis, we filtered out 37 sgRNAs for which the log-normalized reads per million of the pDNA was >4 standard deviations from the mean in at least one of the screens. We then calculated the log2-fold-change between conditions. All reported LFC values for dropout screens are relative to pDNA. We assessed the correlation between log2-fold-change values of replicates.

We also filtered out sgRNAs targeting essential and non-essential genes that were included in the training set, which constituted ~6% of all sgRNAs targeting essential genes and ~0.6% of all sgRNAs targeting non-essential genes.

SSMD calculation

The strictly standardized mean difference (SSMD) between sgRNAs targetting essential and non-essential genes was calculated using the following formula: (\(\mu\)ne\(\mu\)e)/\(\sqrt{\left(\right.}\sigma\)ne2 + \(\sigma\)e2), where ne stands for non-essential and e stands for essential.

Data visualization

Figures were created with Python3 and GraphPad Prism (version 9). Schematics were created with BioRender.com.

Statistics and reproducibility

All z-scores, Pearson and Spearman correlation coefficients were calculated in Python. Low abundance sgRNAs in the plasmid pool were excluded from analysis: we z-scored the pDNA lognorms and removed any sgRNAs with a z-score less than −3.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.