Article Text
Abstract
Aims To study programmed death ligand 1 (PD-L1) expression and tumour infiltrating lymphocytes (TILs) in patients with early-stage non-small cell lung carcinoma (NSCLC) with artificial intelligence (AI) algorithms.
Methods The study included samples from 50 early-stage NSCLCs. PD-L1 immunohistochemistry (IHC) stained slides (clone SP263) were scored manually and with two different AI tools (PathAI and Navify Digital Pathology) by three pathologists. TILs were digitally assessed on H&E and CD8 IHC stained sections with two different algorithms (PathAI and Navify Digital Pathology, respectively). The agreement between observers and methods for each biomarker was analysed. For PD-L1, the turn-around time (TAT) for manual versus AI-assisted scoring was recorded.
Results Agreement was higher in tumours with low PD-L1 expression regardless of the approach. Both AI-powered tools identified a significantly higher number of cases equal or above 1% PD-L1 tumour proportion score as compared with manual scoring (p=0.00015), a finding with potential therapeutic implications. Regarding TAT, there were significant differences between manual scoring and AI use (p value <0.0001 for all comparisons). The total TILs density with the PathAI algorithm and the total density of CD8+ cells with the Navify Digital Pathology software were significantly correlated (τ=0.49 (95% CI 0.37, 0.61), p value<0.0001).
Conclusions This preliminary study supports the use of AI algorithms for the scoring of PD-L1 and TILs in patients with NSCLC.
- Artificial Intelligence
- Lung Neoplasms
- Biomarkers, Tumor
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information. Not applicable.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Manual programmed death ligand 1 (PD-L1) quantification is very time-consuming and subjected to significant interobserver variability. Artificial intelligence (AI)-powered PD-L1 immunohistochemistry (IHC) assessment algorithms offer many advantages but there is a lack of generalisability for most of them. Finally, although tumour infiltrating lymphocytes (TILs) scoring is becoming an established prognostic/predictive parameter across different tumour types, implementation in patients with non-small cell lung carcinoma (NSCLC) has remained elusive.
WHAT THIS STUDY ADDS
We wanted to align AI-based PD-L1 and TILs assessment for their potential prognostic and combinatorial predictive role in this population.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
This study offers pragmatic insights for the use of AI-based PD-L1 and TILs scoring in patients with NSCLC.
Introduction
Few areas in cancer research have been as contentious as the role of predictive biomarkers for selection of immune checkpoint inhibitor (ICI) therapies.1–3 Clinical trials in patients with advanced non-small cell lung carcinoma (NSCLC) have identified an association between overall survival and the expression of programmed death ligand 1 (PD-L1) in tumour cells based on immunohistochemistry (IHC) (ie, tumour proportion score or TPS).4 Accordingly, in the past few years there has been continuous controversy about biological heterogeneity, the different cut-off points, the choice of clones or the use of validated assays versus laboratory developed tests (LDTs).1 5 6 The recently released guideline on the topic, conducted by several professional organisations, strongly recommends the use of a PD-L1 IHC assay.4 If an LDT is chosen it needs to be validated according to the requirements of the corresponding accredited body.4 Interestingly, there was a conditional recommendation regarding the reporting of PD-L1 IHC results using a percentage expression score.4 Despite the success of targeted therapies and immune checkpoint blockade in the advance setting, the 5-year survival of operable NSCLC patients is only 13%–60% for stages II and III,7 so the synergy of chemotherapy, targeted therapies and ICI therapies has rapidly expanded in the adjuvant and neoadjuvant settings.8–12 One of the main knowledge gaps is how to improve patient selection based on sensible, fast, reproducible and cost-effective predictive biomarkers.13 14
Beyond PD-L1 IHC, there is increasing evidence that composite testing strategies might be more informative for the prediction of ICI therapies in patients with NSCLC.14–25 Early systematic evaluation of multiomics association to ICI response identified abundance of tumour infiltrating lymphocytes (TILs), specifically CD8+ T cells, as the most predictive variable across different cancer types.23 Unsurprisingly, the specific characteristics, density and/or spatial layout of CD8+ T cells are emerging as one of the most promising predictive approaches.16 19 25–36 Therefore, the increasing use of artificial intelligence (AI) algorithms for the evaluation of immune markers is an interesting opportunity in this space.13 37–43 In fact, a survey across stakeholder groups identified ‘prediction of treatment response directly from routine pathology slides’ as the most likely future use of AI in digital pathology.44 Specifically, quantification of PD-L1 and counting lymphocytes were two of the only five AI tools that another global Delphi study forecasted as certain within the next decade.45 Therefore, we hypothesised that scoring PD-L1 IHC with AI algorithms is faster, more sensitive and more reproducible than manual interpretation in patients with early-stage NSCLC. Moreover, we wanted to align PD-L1 expression with AI-based TIL assessment tools for the potential prognostic and combinatorial predictive role of TILs in this population.
Material and methods
Study design
A total of 50 consecutive specimens of early-stage NSCLC (40 adenocarcinomas, 9 squamous cell carcinomas and 1 adenosquamous carcinoma) were considered. The material available for all tumours was formalin-fixed and paraffin-embedded (FFPE). Automated IHC for PD-L1 and CD8 expression was performed for all cases on a VENTANA BenchMark ULTRA staining instrument (Roche Diagnostics). FFPE tumour tissues were sectioned at a thickness of 4 µm and stained with H&E and with two monoclonal rabbit antibodies, following the manufacturer’s recommendations: anti-PD-L1 clone SP263 (ready to use, VENTANA PD-L1 SP263 assay) and anti-CD8 clone SP57 (ready to use, CONFIRM anti-CD8 SP57). External controls were included in all the IHC slides. Outcome data were not available.
Manual interpretation of PD-L1 staining
Sections stained for PD-L1 IHC were scanned at ×200 magnification (Roche Ventana DP200 whole-slide scanner) and evaluated independently by two pathologists with different levels of experience (a general pathologist with 10 years of experience (AM) and a thoracic pathologist with 20 years of experience (EC)). The percentage of tumour cells positive membranous±cytoplasmic staining of any intensity was assessed according to the VENTANA PD-L1 (SP263) interpretation guide for NSCLC (TPS). PD-L1 status was scored as a continuous variable and the result was also allocated in one of the following three categories: <1%, 1%–49% and ≥50%. Both initial readers recorded the turn-around time (TAT) in seconds between the pull-out of the H&E on the screen and the writing of the result on the score sheet. A third thoracic pathologist with 30 years of experience (F.L-R) provided a consensus score.
Automated evaluation of PD-L1
Sections stained for PD-L1 IHC were scanned at ×200 magnification (Roche Ventana DP200 whole-slide scanner) and assessed independently using two commercially available AI algorithms that provided a TPS: (1) Navify Digital Pathology (NDP) PD-L1 SP263 image analysis algorithm for NSCLC (Roche Diagnostics) which required the drawing of a region of interest (ROI) and (2) the PathAI AIM-PD-L1-NSCLC algorithm (PathAI) which was a semi-automated tool (ie, the pathologist only had to exclude the external positive control). The NDP PD-L1 SP263 algorithm uses image analysis techniques to classify tumour cells based on their PD-L1 expression as either positive or negative. Rather than relying on an intrinsic detection threshold, cell positivity is determined by a supervised machine learning model. The PathAI AIM-PD-L1-NSCLC algorithm, as previously reported, employs convolutional neural networks.46 Both models were developed using ground truth labels provided by expert pathologists. Scoring TAT was also recorded. TAT was defined as the time in seconds between the start of the drawing tool use and the appearance of the result on the screen. For the NDP algorithm a single ROI was selected by two pathologists (AM and EC) encompassing only the tumour area. A third pathologist (FL-R) provided a consensus score. In the case of the PathAI algorithm, the TPS and TAT from only one observer were considered (EC) given the simplicity of the task.
Automated evaluation of TILs
Sections stained for CD8 IHC and H&E were scanned at ×200 magnification (Roche Ventana DP200 whole-slide scanner) and assessed using the NDP software and the PathAI AIM-TumorCellularity algorithm. For the NDP algorithm, one field of view from each of the three compartments were outlined with a ×20 objective from the CD8 IHC stained sections: intraepithelial, peritumoral stroma and intratumoral stroma, as described previously.47 Subsequently, all selected areas were scored automatically using an off-label nuclear algorithm on the NDP software.47 The density of CD8+ cells, calculated by dividing the number of positive cells by the size of the region (cells/mm2), was recorded for each of the three compartments.47 The total density of CD8+ cells in the entire tissue section was also recorded with the use of a single ROI. For the PathAI algorithm, the lymphocyte readout from the H&E was considered as total TILs and was afterwards used to calculate the density.48 Our primary goal was to investigate both approaches (IHC vs H&E) using two different vendors.
Statistical analysis
Agreement on PD-L1 scoring was evaluated considering PD-L1 both as a continuous variable (ie, PD-L1 TPS) and as a categorical variable. Interpathologist agreement, as well as agreement between evaluation methods, were assessed. In the latter case, consensus scores were used. For PD-L1 TPS, the intraclass correlation coefficient (ICC) was employed as a metric of agreement between measurements, with interpretation as per Portney and Watkins (2000). Specifically, ICC (1,1) and ICC (3,1) from Shrout and Fleiss (1979) were used for interpathologist agreement and agreement between evaluation methods, respectively. Bootstrap CIs were obtained for the ICC through the percentile method.
For PD-L1 considered as a categorical variable, weighted Cohen’s kappa using linear weights was employed, with interpretation as per Landis and Koch (1977). Cochran’s Q test was used to assess whether the proportion of tumours identified as equal or above 1% TPS differed between evaluation methods. Differences in PD-L1 TPS evaluation time by observer or evaluation method were assessed through Wilcoxon signed-rank test and Friedman’s test. Whenever pairwise comparisons were conducted, Holm’s method was employed to adjust for multiple testing.
Association between CD8+ cells density in the different compartments was evaluated through Pearson’s correlation coefficient on log2-transformed data, whereas association between PD-L1 TPS and TILs density was assessed through Kendall’s correlation coefficient. Significance level was set to 0.05. All statistical analyses were conducted using R V.4.4.1.
Results
PD-L1 expression: comparison of AI-powered and manual scoring
All PD-L1 IHC slides were evaluable for manual and AI-powered image analysis (table 1). Agreement for continuous PD-L1 TPSs between observers was excellent for both manual evaluation (ICC: 0.9832 (95% CI 0.9592, 0.9954)) (figure 1A) and when using the NDP PD-L1 algorithm (ICC: 0.9983 (95% CI 0.9967, 0.9993)) (figure 1B). As reflected in figure 1, interobserver agreement was higher in tumours with low PD-L1 expression regardless of the approach. However, in tumours with higher PD-L1 expression, greater differences were identified with manual evaluation than when using the NDP algorithm. There was no need to explore the degree of agreement between users of the PathAI algorithm as this was a fully automated task. As shown in figure 2, agreement between methods was higher in tumours with low PD-L1 expression (ie, differences were smaller closer to 0) regardless of the assay used (figure 3). However, in tumours with higher PD-L1 expression, greater differences were identified (manual vs NDP, ICC: 0.8912 (95% CI 0.7985, 0.9522); manual vs PathAI, ICC: 0.9791 (95% CI 0.9595, 0.9909); NDP vs PathAI, ICC: 0.8944 (95% CI 0.8066, 0.9576)) (figure 4). Next, we wanted to explore the agreement between observers and methods taking into account the categories that are currently used to make clinical decisions. It was reassuring that interobserver agreement was substantial to almost perfect for manual interpretation, and almost perfect for AI-assisted (NDP) interpretation (kappa=0.89 (95% CI 0.79, 1.00) and kappa=1 (95% CI 1, 1), respectively). Of note, both AI-powered tools identified a significantly higher proportion of cases equal or above 1% PD-L1 TPS as compared with manual scoring (p value=0.00015, NDP vs manual adjusted p value=0.00171, PathAI vs manual adjusted p value=0.04124, PathAI vs NDP adjusted p value=0.21552).
The PD-L1 TPS evaluation time, both by the manual method and by the NDP algorithm, was significantly longer in the observer with less experience (table 2, p value <0.0001). Taking into consideration the TAT of the more experienced observer, there were statistically significant differences between manual scoring and AI use, regardless of the algorithm (adjusted p values <0.0001 for all pairwise comparisons). Unsurprisingly, the need to delineate an ROI also influenced the TAT (figure 5).
TILs density by AI-powered scoring and correlation with automated PD-L1 expression
All slides (H&E and CD8 IHC) were evaluable for AI-powered image analysis. The total TILs density with the PathAI algorithm and the total density of CD8+ cells with the NDP software were significantly positively correlated (τ=0.49 (95% CI 0.37, 0.61), p value <0.0001). There was also correlation between the density of CD8+ cells in the different compartments: peritumoral stroma and intratumoral stroma (r=0.71 (95% CI 0.54, 0.82), p value <0.0001), intraepithelial and peritumoral stroma (r=0.65 (95% CI 0.46, 0.79), p value <0.0001) or intraepithelial and intratumoral stroma (r=0.44 (95% CI 0.18, 0.64), p value=0.0017). Despite the correlation between the automated NDP PD-L1 TPS and the total density of CD8+ cells (figure 6; τ=0.33 (95% CI 0.16, 0.49), p value=0.00087) or the density of CD8+ cells in either one of the three compartments (intraepithelial, intratumoral stroma and peritumoral stroma; τ=0.38 (95% CI 0.22, 0.54), p value=0.00011; τ=0.36 (95% CI 0.19, 0.54), p value=0.00019; and τ=0.31 (95% CI 0.14, 0.48), p value=0.0014, respectively), there was no correlation between the NDP PD-L1 TPS and the total TILs density (τ=0.17 (95% CI −0.01, 0.35), p value=0.082).
Discussion
Given that the potential for AI implementation is only realised in a minority of patients with NSCLC, extensive effort has been dedicated to issue recommendations based on the expected barriers and opportunities. Accordingly, we have chosen to investigate three of the advantages of AI: it is more sensitive, it saves time and it makes diagnostics more reproducible.37 49 Although initially it was thought that trust in AI depended on explainability or accuracy, the latest data suggests that the general beliefs in the usefulness of these tools might be more important for adoption.50 Therefore, our goal is to overcome the main challenge, which is cultural (ie, Is implementation difficult? Does AI help me score?), using the recommended pragmatic perspective for integrating AI in pathology.37 51 52 Our results highlight several attributes of AI that could ultimately assist pathologists in daily practice. Regarding TAT, there were statistically significant differences between manual scoring and AI usage, regardless of the algorithm. Interestingly, the agreement between methods was higher in tumours with low PD-L1 expression, and both AI-powered tools identified a significantly higher proportion of cases with PD-L1 TPS equal to or greater than 1% compared with manual scoring, which is a clinically relevant finding. Furthermore, this study supports the use of AI algorithms for TILs scoring.
The information presented here is very timely for several reasons. First, a recent survey covering NSCLC PD-L1 testing practice in the UK suggests that some centres may be under-reporting the PD-L1 TPS and TAT was at least 5 days for more than a quarter of the respondents.53 Moreover, we completely agree with the authors of the UK survey that a ‘gestalt’ approach to scoring, could miss small numbers of positive cells.53 As every pathologist knows, the interpretation PD-L1 IHC on large tissue sections is always more challenging and time-consuming than scoring small biopsies.42 Such eyeballing strategy is becoming increasingly popular (and potentially harmful) in early-stage NSCLC where large tissue sections must be screened for low cut-offs of positivity (Lopez-Rios, unpublished observation). It must be emphasised that manual PD-L1 quantification is very time-consuming and subjected to significant interobserver variability, which can be influenced by the experience and the personality of the pathologist.5 54 55
Although it has previously been shown that published AI-powered PD-L1 IHC assessment algorithms offer many advantages, there is a lack of generalisability and reproducibility for most of them. Despite the frequent use of open-source software, only isolated studies using best-of-class vended products are available and none of them compare several commercial options (reviewed in Ito et al56). Overall, AI reduces interobserver variation, helps resolve discordant cases, bridges the experience gap, identifies patients with low levels of PD-L1, suggesting that there is a better prediction of immunotherapy response.56–64 The limitations include the use of input material deviated from clinical practice (ie, trial samples or tissue microarrays) or approaches that are not easy to implement for cost or access reasons.56 58 59 62–65
The situation outlined previously prompted us to investigate two commercially available AI alternatives that are easy to reconcile with clinical practice. AI-powered PD-L1 TPS scoring was faster than manual scoring, regardless of the experience of the pathologist. In one of the few studies that have addressed this matter, only untrained pathologists benefited from the speed of AI, and surprisingly the experienced observer’s readout times were longer.60 Therefore, two variables influence the TAT of AI algorithms: (1) the degree of engagement or experience of the pathologist-in-the-loop, sometimes in a counterintuitive manner; and (2) the subjectivity in the selection of the number and size/shape of the ROIs: for example, none versus one versus multiple.57 In fact, our excellent interobserver AI agreement was consistent with the use of either an individual ROI or a very reproducible exclusion rule (ie, the external positive control area), as both AI tools operated with very little user intervention. Along these lines, one of the algorithms reported here has recently shown to improve interpathologist concordance when scoring NSCLC samples, which might help with the generalisability and reproducibility of this commercial AI option.57
Finally, AI-assisted PD-L1 TPS has been reported to be higher than pathologists assessment in surgical specimens,56 in keeping with the notion that AI can identify a higher number of PD-L1 positive patients, especially at lower cut-offs.59 60 64 Reflecting this tendency, both AI-powered tools in our study identified more patients than manual scoring at the 1% TPS cut-off. The clinical utility of this hypothesis has been demonstrated with samples from registry trials using one of the algorithms that we have investigated.58 61 In addition to the possibility of identifying more patients’ candidates to ICI, AI-powered PD-L1 scoring has the potential to transform the current categorical interpretation into a continuous predictive biomarker, as recommended in a recently released international guideline.4 38 Overall, the two PD-L1 TPS algorithms that we have tested performed similarly and further validation studies are warranted to speed up their implementation in clinical practice. This is a neglected topic in the literature, as there is more interest in the evaluation of one algorithm across several PD-L1 clones than in the design of head-to-head comparisons of TPS analysers.64
Second, although TILs scoring is becoming an established prognostic/predictive parameter across different tumour types,66–69 implementation in patients with NSCLC has remained elusive. This is very unfortunate as the number of reports in early-stage (refs47 70–78 and reviewed in Conde et al47 and Donnem et al77) or advanced NSCLC (refs18 19 26–28 30–32 36 79–83 and reviewed in Sholl21) continues to grow. The main reasons are the lack of standardisation of the assessment tools or protocols and the limited interest for replicating methodologies established by others. It is astonishing that there is literally no overlap in the design of the studies, with a high number of interconnected variables preventing harmonisation47 77: immunotherapy alone versus immunotherapy plus chemotherapy versus chemoradiotherapy, different ICI agents or treatment lines, tissue microarrays versus whole slides, NSCLC versus squamous cell carcinoma versus adenocarcinoma, single markers versus multiplexed technologies, manual scoring versus digital analysis, standard H&E staining versus IHC versus immunofluorescence versus RNA expression, CD8+ TILs versus other markers, density versus spatial distribution (ie, stromal vs intraepithelial compartment, periphery vs centre of the tumours), open-access algorithms versus prototypes versus out-of-the-box solutions.18 19 26–28 30–32 36 43 47 70–84
To partially overcome the shortcomings, we used commercially available AI-based TILs assessment tools (ie, ready-to-go) with very simple protocols that could be replicated elsewhere. We first wanted to consider the location of CD8+ TILs, because the investigation of the spatial distribution of TILs is feasable in resection specimens. Consistent with previous data, the use of a modified nuclear algorithm to quantitate CD8+ TILs was successful.47 85–87 As expected, there was a significant correlation between the CD8+ TILs result in all three compartments, which was higher for the two stromal readings comparison. In agreement with other authors, we believe that scoring the intraepithelial compartment is less reproducible,18 83 particularly for lung ACs, due to the presence of blank spaces in most ACs architectural patterns of growth. In contradistinction, the solid pattern of growth of lung squamous cell carcinomas usually allows for an easier delineation of the three compartments.47 Therefore, the over-representation of ACs in most series is probably underscoring the literature focus on stromal TILs.18 31 34 83 88 Accordingly, we next wanted to explore AI-based total TILs density assessment, reasoning that this strategy would be more realistic in small biopsies and could also overcome the alleged lack of reproducibility when localising TILs.18 20 47 77 Consistent with this notion, the total TILs density and the total density of CD8+cells were strongly positively correlated.
Although the main limitation of our study is that we do not have prognostic or predictive information from our small cohort, the correlation between PD-L1 and CD8+ TILs further support the validity of our results. A review of published data in the light of our findings suggests that this association is a well-known fact, despite the lack of clinical implementation of TILs as a complementary predictive biomarker for ICIs.19 31 33 34 79 80 Another limitation is that the detection thresholds of the AI algorithms cannot be adjusted based on variations in IHC staining. Therefore, the interlaboratory reproducibility of our results needs further exploration.
In conclusion, our exploratory study addresses some of the pros and cons of introducing AI in pathology laboratories. The use of commercially available AI algorithms in NSCLC surgical specimens have the potential to improve PD-L1 IHC scoring and to standardise TILs assessment. Our early insights could provide a way forward the integration of these predictive-prognostic biomarkers into pathology workflows for all patients with NSCLC.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information. Not applicable.
Ethics statements
Patient consent for publication
Ethics approval
The institutional ethics committee 'Hospital Universitario 12 de Octubre' reviewed and approved this study (number: 22/03).
Acknowledgments
AM and FL-R are supported by the Tom Crean expedition. We thank Roche and PathAI for providing additional details on the capabilities of their algorithms.
References
Footnotes
AM and SH are joint first authors.
Handling editor Vikram Deshpande.
X @turandot85
AM and SH contributed equally.
Contributors AM, EC and F L-R scored the immunohistochemistry slides. AM, SH, MA, MP DC, F L-R and EC contributed to data curation, analysis and manuscript writing. MP was in charge of the statistical design and analysis. MA and EC were the digital pathology and artificial intelligence coordinators. FL-R was responsible for the overall content. FL-R acts as guarantor.
Funding This work was supported by grants from Fundacion Mutua Madrileña (AP18051-2022), Instituto de Salud Carlos III (ISCIII) (PI17-01001, Fondos FEDER and Plan Estatal 2013-2016 and PI22-01700, co-funded by the European Union) and the Comunidad de Madrid iLUNG Programme (P2022/BMD-7437). Roche provided funding for the immunohistochemistry reagents.
Competing interests Regarding the scope of this work, SH, FL-R and EC have received funding and honoraria from Roche. MA has served as a speaker for Roche. The remaining authors declare no conflict of interest.
Provenance and peer review Not commissioned; externally peer reviewed.