1. Introduction
Advances in biomarker assessment, companion diagnostics and genomics have revolutionised the way breast cancer is currently classified and managed [
1,
2]. The immune microenvironment of solid tumours, including in breast cancer, plays a pivotal role in tumour development and progression [
3,
4,
5]. Cancer cells can evade the regulatory pathways of Programmed death-1 (PD-1) and its ligand (PD-L1), thus overcoming the cytotoxic effect of T cells. Immune checkpoint blockades using anti-PD-L1 inhibitors have been investigated in various trials in lung, melanoma and, more recently, breast cancer, with confirmed efficacy [
6,
7,
8]. This has led to the approval of immune modulators for the treatment of PD-L1-positive breast cancer, and this is currently being incorporated in various guidelines [
9]. The first-approved and most established immune checkpoint inhibitor in breast cancer is atezolizumab, for which a companion diagnostic assay (the VENTANA SP142) is required for selecting patients eligible for this drug.
The limited data available in the literature on non-breast cancer suggest the poor reproducibility of PD-L1 SP142 scoring [
10]. Some studies compared the performance of various PD-L1 assays [
11,
12], and only few analysed pathologist concordance in the scoring of breast cancer [
13,
14,
15]. Those latter studies were small and heterogeneous, with some including training sets [
14]. Furthermore, the nature of discordant cases was not analysed, nor was there an assessment of the intra-observer agreement or the effect of the pathologist’s experience. In addition, all previous studies focused on TNBC, and, therefore, information on pathologist concordance in the scoring of PD- L1 in HER2-positive and/or luminal breast cancer does not exist. Emerging data suggest cross-talk between HER2 and PD-L1 and potentially support the use of immunotherapy in HER2-positive breast cancer [
16]. PD-L1 expression is correlated with the response to neoadjuvant chemotherapy in HER2-positive breast cancer [
17].
We therefore aimed to assess the inter- and intra-observer concordance of breast pathologists of various expertise and geographical locations in reporting a large cohort of PD-L1 SP142-stained invasive breast carcinomas of various molecular subtypes to assess if particular molecular subtypes would be more or less prone to poor inter-observer concordance. We also sought to analyse discordant cases in detail to gain insight into the reasons for discrepancies in PD-L1 results, allowing for a subsequent search for strategies on how to tackle them.
2. Materials and Methods
Core biopsies from a total of 100 cases of primary breast cancers were included in the study. Cases were selected retrospectively from the files of a single large UK institution (Queen Elizabeth Hospital Birmingham) to include all molecular subtypes with enrichment for the TNBC group.
First, 4 μm sections of formalin-fixed, paraffin-embedded tumour blocks were cut and stained using the VENTANA SP142 anti-PD-L1 rabbit monoclonal primary antibody and a VENTANA Benchmark ULTRA automated staining platform, according to the manufacturer’s protocol. A section from a cell block containing three cell lines with various staining intensities and a section of normal tonsil were included as on-slide controls. Paired H&E sections and PD-L1-stained immunohistochemistry slides were digitally scanned using a Leica Aperio AT2 slide scanner at x40 and uploaded to the University of Birmingham digital platform via a secure link:
https://eslidepath.bham.ac.uk, Last accessed 23 February 2023. Each participant was provided with a unique username and password to allow for access to the digital platform for whole slide scoring. Twelve pathologists from eight institutions representing three European countries (United Kingdom, Republic of Ireland, Belgium) evaluated all cases in round one, of whom 10 re-scored the same cases in round two, separated by at least 3 months of a washout period designed to assess intra-observer variability. All pathologists had previously received Roche training for SP142 PD-L1 scoring in TNBC and passed a proficiency test.
PD-L1 SP142 Immune cell (IC) scoring was conducted according to the recommended scoring algorithm [
18], using a cut-off value of ≥1% to indicate positivity. In addition, the pathologists were asked to provide their percentage of immune cells with positive staining for each case, including those cases scored as negative. All scorers completed a survey assessing their experience in breast pathology reporting as well as their training and real-life reporting of PD-L1.
Statistical Analysis
The data were tabulated and statistically analysed using the SPSS (IBMS) software version 28. We used standard statistical analyses for assessing intra/inter-rater concordance/agreement, which have been previously described [
19]. Intraclass correlation coefficient (ICC), which is a measure of the reliability of ratings (using median percentage scores), was used to determine if subjects/items can be rated reliably by different raters. ICC is a descriptive statistic used to assess the consistency or reproducibility of quantitative measurements made by different observers measuring the same quantity. The value of an ICC can range from 0 to 1, with 0 indicating no reliability among raters and 1 indicating perfect reliability among raters. The ICC results are interpreted as follows: values < 0.5 indicate poor reliability, values from 0.5 to 0.75 indicate moderate reliability, values from 0.75 to 0.9 indicate good reliability and values greater than 0.9 indicate excellent reliability [
20]. In our study, we used a Two-Way Random model, testing both the consistency and the absolute agreement relationships and the mean of ratings as the unit of measurement.
Fleiss multiple-rater Kappa statistics of inter-observer and intra-observer agreement for designating cases as PD-L1-positive versus -negative using a cut-off value of 1% were calculated. Fleiss’ Kappa κ is a measure of inter-rater agreement used to determine the level of agreement between two or more raters when the method of assessment, known as the response variable, is measured on a categorical scale. The Kappa results are interpreted as follows: values ≤ 0 indicate no agreement, values from 0.01 to 0.20 indicate none to slight agreement, values from 0.21 to 0.40 indicate fair agreement, values from 0.41 to 0.60 indicate moderate agreement, values from 0.61 to 0.80 indicate substantial agreement and values from 0.81 to 1.00 indicate almost perfect agreement.
A case was regarded as PD-L1 positive or -negative if more than 50% of the participants designated it as positive or negative, respectively. The consensus score was considered a majority score if 67% or more of the participants agreed on the categorisation. If all participants agreed (100%), this was regarded as absolute agreement (AA). The cases with agreement less than 67% and above 50% were considered challenging. In cases of no agreement (50% or less), a case was considered as PD-L1-positive or -negative based on the consensus of the experienced pathologists only.
Scatter plots were used to visualise percentage PD-L1 scores, and the strength of the relationship between scores was expressed as a squared correlation coefficient (R2). All analyses were supervised by an expert in pathology informatics (PL).
An outline of the study methodology is shown in
Figure 1.
4. Discussion
We present comprehensive data of a large PD-L1 concordance cohort, scored twice by pathologists from eight institutions, representing three countries. Our data show reassuring inter- and intra-observer agreements, which were the highest among experts, and highlight cancers with low levels of PD-L1 expression as the most challenging in classifying as either PD-L1-positive or -negative.
Unlike standard diagnostic and prognostic markers for breast cancer, SP142 PD-L1 immunohistochemistry is assessed in the immune micro-environment of breast cancer and not in the neoplastic cells themselves. PD-L1 expression in foci of ductal carcinoma in situ (DCIS), necrotic debris, normal mammary tissue and normal nodal tissue is excluded. Therefore, experience in both tumour morphology and PD-L1 assessment is required and may affect the reproducibility of scoring.
Few studies, summarised in
Table 9, have addressed the consistency of PD-L1 reporting among pathologists. A prospective multi-institutional study showed the poor reproducibility of PD-L1 scoring, with pathologists disagreeing on the classification of cases as PD-L1-positive or -negative in over half of the scored cases, and the with complete agreement of SP-142 scoring in only 38% of cases [
21]. In a cohort of 426 tumours of Chinese women, the concordance between two pathologists in PDL-1 scoring was 78.2%, with a Kappa value of 0.567, and 61.4% in primary tumours and nodal metastasis, respectively, indicating moderate agreement [
22]. Using “Observers Needed to Evaluate Subjective Tests” (ONEST), Reisenbichler et al. [
21] reported a decreased overall percentage agreement with the increase in the number of pathologists assessing each case, with the lowest concordance at eight pathologists or more. Another study of 79 PD-L1 SP142-stained breast cancers scored by experienced breast pathologists at the Memorial Sloan-Kettering Cancer Centre revealed strong agreement [
23]. Our data, based on a larger cohort of TNBC cases, confirm the substantial agreement and show that concordance was higher among experts than among those with no experience in reporting PD-L1. More importantly, the agreement among experts was observed as substantial to perfect in those challenging cases, and those experts showed a much higher consistency in reporting challenging, low-expressing TNBC, a finding that is relevant to clinical practice. This is in accordance with findings in other biomarkers [
24] and reflects the importance of testing at regional institutions with quality-assured protocols and experienced scorers and the value of discussing/referring difficult/equivocal cases to expert pathologists for their opinions.
While several antibodies/assays for PD-L1 assessment are available (e.g., 22C3, 28-8, SP142, SP263 and 73-10), the VENTANA Roche SP142 assay is the only companion FDA- and CE-IVD (European Commission in vitro diagnostics)-approved test for atezolizumab therapy. An expert round table in 2019 [
25] recommended the assay as the only approved companion diagnostic for selecting patients for immunotherapy and recommended using the primary tumour samples, where available, over metastases for assessment. In the UK, atezolizumab plus chemotherapy, and its companion diagnostic assay, were granted approval by the National Institute of Health and Care Excellence (NICE) for the treatment of locally advanced/metastatic PD-L1-positive TNBC. More recently, pembrolizumab plus chemotherapy has been approved for the same indication for PD-L1-positive TNBC using the companion diagnostic Agilent 22C3 assay.
In this study, we assessed both the inter- and intra-observer concordance among the participating pathologists. It is notable that the intra-observer concordance was high (0.667 to 0.956) among both expert and non-expert pathologists in PD-L1 scoring, indicating that pathologists are likely to stick to their parameters on scoring. When the median percentage of PD-1 expression was compared among the raters, the highest ICC (0.974) was achieved among experienced raters in the second round. We observed the lowest concordance value of 0.619 when comparing non-experienced to experienced scorers. Similarly, a higher concordance among those experienced in PD-L1 scoring (93.3%) compared with non-experts (81.5%) was previously reported by Pang et al. [
26].
While, overall, there was a high concordance among pathologists in PD-L1 SP142 scoring, some cases were challenging to score. Those cases comprised 6–8% of all cases and generally showed very low levels of expression spanning the threshold for positivity. These may represent a so called “borderline category” where expression cannot readily be designated into a clear-cut positive or negative status. Ideally, information on the tumour response to immunotherapy should determine how those cases should be classified. It is of interest that expert pathologists, who routinely reported PD-L1 in breast cancer, showed substantial concordance in scoring those difficult cases. We therefore recommend that those cases of very low expression (i.e., close to the 1% cut-off value) are scored by an expert pathologist either via double-reporting or via a second opinion referral.
Table 9.
Summary of studies evaluating the SP142 PD-L1 concordance of scoring.
Table 9.
Summary of studies evaluating the SP142 PD-L1 concordance of scoring.
Reference | Number of Cases (Type) | Clone(s) | SP142 Scoring Method | Scorers | Inter-Observer Agreement | Intra-Observer Agreement |
---|
Downes et al. 2020 [19] | 30 surgical excisions TMAs | 22C3, SP142, E1L3N | IC ≥ 1% | 3 pathologists | Kappa for IC1%: 0.668 | 1 month washout period. Kappa = 0.798 |
Noske et al. [13] | 30 (resections) | SP263, SP142, 22C3, 28–8 | IC ≥ 1% | 7 trained + one Ventana SP142 expert for SP142 only | ICC for SP142: 0.805 (0.710–0.887) | Not tested |
Dennis et al. (abstract) [14] | 28 test sets through the Roche International Training Programme | SP142 | IC ≥ 1% | 432 (trained multiple institutions), from several countries | OPA: was 98.2%, with PPA of 99.4% and NPA of 96.6%. | Not tested |
Hoda et al. [23] | 75 (cores and excision), primary and metastases | SP142 | IC ≥ 1% | 8 experienced (single institution) | Kappa 0.727 | Not tested |
Reisenbichler et al. 2021 [21] | 68 cases for SP142 and 67 cases for SP263 | SP142, SP263 | IC ≥ 1% & % expression for cases scored as positive only | 19 randomly selected pathologists from 14 US institutions; breast pathologists, with few non-breast pathologists. Experience in reporting PD-L1 not stated | Complete agreement for SP142 categorisation into positive vs. negative in 38%. Agreement decreased with the increasing number of scorers, reaching a low plateau of 0.41 at eight scorers or more | Not tested |
Pang et al. [26] | 60 TNBC TMAs | VENTANA SP142, DAKO 22C3 | IC ≥ 1% | 10 pathologists including 5 PD-L1 who were naïve and 5 who passed a proficiency test | 93.3% for experts; 81.5% for non-experts. | Tested after a 1 h training video and an overnight washout period. OPA increased from 81.5% to 85.7% for non-experts after video training. OPA was 96.3% for experts. |
Van Bockstal et al. 2021 [15] | 49 metastatic TNBC (biopsies and resections) | VENTANA SP142 | IC ≥ 1% | 10 pathologists; all passed a proficiency test | Substantial variability at the individual patient level. In 20% of cases, chance of allocation to treatment was random, with a 50–50 split among pathologists in designating as PD-L1-positive or -negative | Not tested |
Ahn et al. 2021 [27] | 30 surgical excisions | SP142, SP263, 22C3 and E1L3N | ICs and TCs were scored in both continuous scores (0–100%) and five categorical scores (<1%, 1–4%, 5–9%, 10–49% and ≥50%). | 10 pathologists with no special training, of whom 6 underwent Ventana Roche training | 80.7% inter-observer agreement at a 1% cut-off value | Proportion of cases with identical scoring at a 1% IC cut-off value increased from 40% to 70.0% after training |
Abreu et al. 2022 (Conference abstract) [28] | 168 in tissue microarrays | 22C3 and SP142 | Not stated | 4 pathologists including 2 breast pathologists and 2 surgical pathologists with no specific PD-L1 training | Overall concordance for SP142 was 64.8%; overall κ = 0.331, with κ = 0.420 for breast pathologists and κ = 0.285 for general pathologists | Not tested |
Chen et al. 2022 [22] | 426 primary and metastatic surgical excisions | SP142 | IC ≥ 1% | Two experienced pathologists | 78.2% concordance; κ = 0.567 | Not tested |
Current study | 100 (cores), primary breast cancer | SP142 | IC ≥ 1% & % expression for all cases; two rounds of scoring separated by a 3-month washout period | 12 experienced breast pathologists from 8 institutions in the UK, Ireland and Belgium. All passed a proficiency test. | Absolute agreement was substantial in 52% and 60% of cases in the first and second rounds, with Kappa values of 0.654 and 0.655 for the first and second rounds, respectively. Higher concordance among experts, particularly in TNBC and challenging cases. | Tested after 3 months of a washout period. Almost perfect agreement regardless of pathologists’ PD-L1 experience |
Similar challenges in PD-1 scoring have been highlighted in carcinomas in other tissues. For example, the concordance between the assays used for PD-L1 assessment in head and neck squamous cell carcinoma (HNSCC) was fair to moderate, with a tendency for the SP142 assay to better stain the immune cells [
29]. Furthermore, using 3 PD-L1 tests for HNSCC tissue microarrays (standard SP263, standard 22C3 and in-house-developed 22C3), significant differences were found among the three tests using clinically relevant cut-off values, i.e., ≥20 and ≥50%, for the combined positive score (CPS) and Tumour positive score (TPS). Intra-tumour heterogeneity was generally higher when CPS was used [
30]. On the other hand, Cerbelli et al. showed a high concordance between the 22C3 PharmDx assay and the SP263 assay on 43 whole sections of HNSCC [
31]. The data collectively highlight the challenges in PD-L1 assessment in various cancers, including the differences in the results between the available antibody clones and staining platforms.
Our data also confirm previous studies showing the highest proportion of PD-L1 positivity in TNBC [
32]. PD-L1 was previously shown to be associated with higher tumour grades and higher pCR rates. Low levels of expression were associated with shorter recurrence-free survival (RFS), including following subtype adjustment [
32].
The current study and previous lessons from the IMpassion trial [
33] shed some light on issues related to the immunohistochemical assessment of PD-L1 in breast cancer tissue. The strengths of the study include the large cohort of cases, the inclusion of 12 pathologists from three countries, the inclusion of both expert and non-expert assessors, the robust design, with the assessment of inter- and intra-observer concordance in two rounds, and the detailed statistical analysis. The digital analysis of whole slide images, rather than scoring glass slides, may be a weakness for pathologists who are not used to digital reporting. More recently, the use of digital image analysis algorithms and/or artificial intelligence (AI) has been proposed for PD-L1 scoring in various solid tumours [
34]. Going forward, this is an exciting and promising endeavour that requires thorough validation in comparison to the gold-standard pathologist scoring before implementation and the determination of whether those algorithms are superior to manual scoring in identifying responders to immune therapy. Currently, PD-L1 Artificial Intelligence (AI) scoring in breast cancer is limited to research studies and has not been validated for routine clinical use.
5. Conclusions
In summary, we present a detailed analysis of 12 pathologists who scored 100 digitally scanned breast cancer slides for PD-L using the Ventana SP142 assay in two rounds separated by a washout period. Absolute (100%) agreement was substantial in 52% and 60% of cases in the first and second rounds, with Kappa values of 0.654 and 0.655 for rounds one and two, respectively. We provide reassuring evidence of a high concordance of PD-L1 reporting among pathologists, the highest being among experts and in reporting challenging, low-expressing TNBC. The intra-observer agreement was substantial for all raters. Despite experience and the adherence to current reporting guidelines, there remains a minority of tumours (6–8%) that are challenging to assign to either a positive or negative category. Those are PD-L1 low-expressing and/or heterogeneous tumours that suffer from the least concordance among pathologists. Consensus scoring and referrals for expert opinions should be considered in those cases. If uncertainly persists, this should be recognised and well communicated to clinicians in the context of a multidisciplinary approach. For inconclusive cases, testing on another tumour sample and/or using another assay (e.g., the DAKO 22C3 assay for selecting patients for pembrolizumab therapy) could be performed.
Pathologists’ training and experience are paramount in evaluating PD-L1 expression and selecting patients for immune checkpoint anti-PD-L1 inhibitors. Further work on refining the criteria for scoring, pathologists’ training and assessing pathologist concordance is needed. This will ensure the accurate classification of tumours into a positive or negative category and, hence, the accurate selection of patients for atezolizumab therapy.
This study also shows that digital pathology is a useful tool that allows for the instantaneous sharing of high-quality whole slide scans with colleagues. This is particularly helpful for consensus scoring and/or seeking expert opinions.