Aims To evaluate if a deep learning algorithm can be trained to identify tumour-infiltrating lymphocytes (TILs) in tissue samples of testicular germ cell tumours and to assess whether the TIL counts correlate with relapse status of the patient.
Methods TILs were manually annotated in 259 tumour regions from 28 whole-slide images (WSIs) of H&E-stained tissue samples. A deep learning algorithm was trained on half of the regions and tested on the other half. The algorithm was further applied to larger areas of tumour WSIs from 89 patients and correlated with clinicopathological data.
Results A correlation coefficient of 0.89 was achieved when comparing the algorithm with the manual TIL count in the test set of images in which TILs were present (n=47). In the WSI regions from the 89 patient samples, the median TIL density was 1009/mm2. In seminomas, none of the relapsed patients belonged to the highest TIL density tertile (>2011/mm2). TIL quantifications performed visually by three pathologists on the same tumours were not significantly associated with outcome. The average interobserver agreement between the pathologists when assigning a patient into TIL tertiles was 0.32 (Kappa test) compared with 0.35 between the algorithm and the experts, respectively. A higher TIL density was associated with a lower clinical tumour stage, seminoma histology and lack of lymphovascular invasion.
Conclusions Deep learning–based image analysis can be used for detecting TILs in testicular germ cell cancer more objectively and it has potential for use as a prognostic marker for disease relapse.
- digital pathology
- tumour immunity
- image analysis
Statistics from Altmetric.com
In this study, we trained a deep learning algorithm for automated identification and counting of tumour-infiltrating lymphocytes (TILs) in testicular germ cell tumours. Within the field of image-based medical diagnostics, deep learning has recently shown encouraging results in ophthalmology,1 dermatology,2 diagnostic pathology3 as well as cancer outcome prediction.4 Deep learning, an approach to machine learning based on artificial neural networks, has improved the accuracy of many pattern recognition tasks, such as speech recognition, visual object recognition and interpretation of scenes in digital images.
The concept that immune cells play a key role in controlling the growth and recurrence of tumours has been postulated and gained more widespread recognition with the ‘Immunoscore’ in colorectal cancer which demonstrated a prognostic value superior to the TNM classification.5 Seminomas typically comprise an inflammatory infiltrate consisting of CD4 and CD8 T cells with T cells outweighing the numbers of B cells.6 In the majority of studies, the TIL count has been assessed by a pathologist rather than by image analysis approaches.7 However, this approach is not objective and has poor reproducibility. Moreover, such subjective analysis is limited by the lack of standardised cut-offs and standard procedures to assess the level of TILs, for example, the use of proportion of cells versus the absolute number of cells, evaluation of the whole tumour slide compared with stroma only and scoring the whole slide compared with only the invasive front.8
Digitisation of tissue samples enables new technological advances to be exploited, including the application of machine learning approaches to perform image analysis.
TILs are commonly quantified through microscopy of tissue sections stained for basic morphology with H&E and can be expressed as an absolute count or as a percentage of area.9 Challenges in the visual evaluation relate to subjectivity; evaluations suffer from intraobserver and interobserver variability, and manual assessment of whole-slide samples is time consuming and semiquantitative.10
Most testicular tumours are germ cell tumours, which are divided for clinical management purposes into seminomas and non-seminomas. Since the introduction of platinum-based chemotherapy regimens, outcomes in testicular cancer have been among the most favourable. However, a small number of patients continue to die from disseminated disease. At the opposite end of the spectrum is the problem of adverse effects of adjuvant treatment in patients who would have been cured by orchiectomy alone. Prognostic factors for disease progression are virtually absent, with lymphovascular invasion in non-seminomas and tumour size in seminomas being the exceptions.
This proof-of-concept study describes the implementation of a deep learning–based algorithm for testicular germ cell tumours that can be used for quantification of TILs in tissue samples stained for basic morphology only. It also reports the evaluation of a computerised TIL assessment to predict relapse of the disease and the correlation of TIL count with other clinical and pathological variables.
Materials and methods
A total of 113 patients with primary testicular cancer (69 seminomas and 44 non-seminomas) were retrospectively selected for the study. The non-seminomas were defined as per clinical management pathways and included any pure non-seminoma subtype or mixed testicular germ cell tumours. These were an unselected series of patients from 2005 to 2016 for which slides were available for scanning. Between 1 and 26 H&E slides per patient (average five slides) were scanned in anonymous form and clinical data were available in a Caldicott Guardian–approved testicular tumour database at Oxford University Hospitals NHS Foundation Trust, UK.11 12 The following clinical and pathological data were recorded: size of tumour, laterality, clinical stage at presentation, rete testis stroma invasion, lymphovascular invasion, pathological tumour stage, histological tumour type, disease relapse, time to relapse, adjuvant chemotherapy and overall length of follow-up. All samples are retrospective archival material and the results did not have any impact on the management of the patients.
Digitisation of samples
Before digitisation, all personal identification information and unique identifiers were replaced with project-specific codes. The original diagnostic slides (n=113) from the pathology archives were scanned for the study. These were derived from sections (4–5 µm thick) cut from formalin-fixed, paraffin-embedded whole-slide blocks, stained with H&E and digitised with a whole-slide scanner (Hamamatsu 2.0 Nanozoomer) using a ×20 objective. Images were stored in a proprietary format (NDPI) with a pixel in the image corresponding to a 0.227×0.227 µm area in the sample. Images were further compressed to a wavelet file format (Enhanced Compressed Wavelet, ECW, ER Mapper; Intergraph, Atlanta, Georgia, USA) with a target compression ratio of 1:10. The compressed virtual slides were uploaded to a whole-slide image (WSI) management server (WebMicroscope; Fimmic Oy, Helsinki, Finland).
Training and testing with deep learning algorithm
The most representative slide for each case was selected by an expert testicular tumour pathologist (CV). If seminoma, usually the features (including TILs) were consistent across the slides, otherwise a slide showing as little necrosis as possible and one that appeared to be representative of the degree of TIL infiltration was selected. For non-seminoma, a slide that represented as many of the tumour subtype elements as possible was selected. A region of interest was selected from each slide, which was a rectangle comprising a representative area of the tumour. The rectangle was created on a case-by-case basis by assessment of individual slide images and was made as large as possible within the tumour without including significant areas of necrosis or background tissue, avoidance of necrotic areas and outside the tumour border being part of the guidance in assessment of TILs in breast tumours.9
For the digital image analysis of the samples, a commercially available image analysis software platform was used (WebMicroscope) that uses deep learning–based machine-learning algorithms to create computer vision applications. The model for the identification of the TILs consists of two sequential algorithms; the first algorithm analyses the whole image for detection of tissue regions that contain TILs and then passes the results on to the second algorithm that counts the number of TILs found in each region.
The training and test set used for the creation of the algorithms consists of whole-slide images from 28 patients (19 seminomas and 9 non-seminomas) of the 89 patients from which 259 small regions (150×150 pixels) corresponding to an area of 34×34 µm in the sample were randomly selected (figure 1). These comprised 101 images with TILs and 158 that did not contain TILs. From the 259 images, 129 were used for training and 130 for testing. Annotation of TILs was done by one of the researchers (NL) and verified by a pathologist (CV), and a total of 812 TILs were annotated by drawing pixel-level polygons encompassing the TILs, with a median of 7 TILs per 150×150 pixel image. The rationale for annotating 150×150 pixel images was to keep the manual drawing task feasible since a pixel-level mask was drawn for each TIL. Based on the labelled areas, the software fits a deep learning model for locating candidate regions of interest, in this example the TIL regions, and generates a region segmentation mask. The candidate regions are then used as input in training of a second classifier that performs object detection and localisation and counts the number of detected TILs.
The three-colour channels in the images (red, green and blue) were converted into greyscale for training and analysis and the feature size for detection of TIL candidate regions was set to a diameter of 30 pixels, corresponding to an area with a diameter of 6.81 µm in the sample. Image augmentation was used to create synthetic variations of the images in the training data. For training the second algorithm, aimed at counting TILs, a feature size of 40 pixels was used, corresponding to an area with a diameter of 9.08 µm in the sample. Methods image augmentation include the following perturbations: rotation (random 0–360 degrees), scaling (5%), aspect ratio (5%), shear (5%), flipping (both horizontal vertical), contrast (5%) and luminance (5%). A total of 50 complete passes (epochs) through the data set were performed during optimisation of the algorithm and the methods of perturbation were the same for the two sequential algorithms. The deep learning software estimates the probability of a region (first algorithm) and object (second algorithm) in the image belonging to the TIL class with a probability percentage (0%–100%). Using probability thresholds of 0.4 and 0.5 for the first and second classifiers, respectively, the number of TILs in both the smaller train/test image regions (150×150 pixels) and bigger areas corresponding to areas of 1.16–474.5 mm2 (average size 70 mm2) in the sample selected from the tumour WSIs (n=113) were calculated. In the larger areas, the TIL density was calculated as the number of TILs per square millimetre of the analysed tissue area. A quality check regarding both image and algorithm detections was performed manually (NL and JL) on the areas selected from the WSI and a total of 24 patients were excluded (14 non-seminomas (31.8%) and 10 seminomas (14.5%)) due to presence of out-of-focus areas or failure of the algorithm. The higher failure rate with non-seminomas is likely to be a reflection of the vast morphological heterogeneity that can be seen in these tumours (embryonal carcinoma, yolk sac tumour, teratoma, choriocarcinoma and mixed). A final set of 89 patients was used for testing the algorithm on WSIs (figure 1).
Assessment of the algorithm performance
The TIL estimation algorithm was assessed both using pixel-level ground truth annotations in the smaller regions of interest (150×150 pixels) and in the bigger regions selected from the tumour area in the whole-slide images (image size range 24.5 megapixels to 9.2 gigapixels).
Visual assessment of TILs by pathologists
Three pathologists assessed the percentage of TILs in the same tumour regions as described above, selected from the WSIs (n=89), blinded to the clinicopathological data and patient outcome. The assessment was performed using a web-based scoring tool on a digital microscopy platform (WebMicroscope). Since it was not feasible for the pathologists to count the TILs in the extracted areas, they instead estimated the percentage as a continuous parameter (0%–100%), and this score was then divided at tertiles to allow comparison with the algorithm score. The percentage TILs in the extracted area was assessed as an average, not focusing on hotspots, and all mononuclear cells were included, but polymorphonuclear leucocytes were excluded.
The detection accuracy of the classifier on a pixel level was evaluated with the area under the receiver operating characteristic (ROC) curve (AUCROC). The ROC curve is defined with false-positive rate (1−specificity) on the x-axis and true positive rate (sensitivity or recall) on the y-axis. The positive predictive value (precision) is defined as the ratio between true positives and the sum of true positives and false positives. The F-score is defined as a harmonic mean of precision and recall: 2×(precision×recall)/(precision+recall) and used in selection of the threshold for TIL classification. The agreement between the pathologist and automated method in the assessment of TILs was estimated by percent agreement and kappa statistics. Kappa values were categorised as suggested previously in the literature: <0 as disagreement, 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial and 0.81–1 as almost perfect agreement.13 The association between the number of manually annotated cells and the algorithm count was analysed using Spearman’s rank-order correlation and a correlation coefficient presented (Spearman’s rho). The association between the number of TILs/mm2 of tumour tissue and other pathological factors and patient outcome was analysed using the χ2 test or Fisher’s exact test in case of low contingency table cell entries.
Classification accuracy on small cell-level annotated areas
The median number of annotated TILs in the TIL positive (n=47) small region test set (150×150 pixels) was 7 (range 1–26) and the median number of TILs detected by the algorithm was 7 (range 1–29). Overall F-score for the TIL classifier on a cell level in the small region test set was 0.88 while the corresponding sensitivity (recall) was 89% and positive predictive value (precision) 87%. The correlation coefficient between the number of manually annotated TILs and the algorithm counted TILs on a cell level in the small regions of interest (150×150 pixels) was 0.89 (Spearman’s rho) (figure 2).
Quantification and visualisation of TILs on tumour whole slides
The TIL detection algorithm was applied to bigger tumour areas selected from WSIs (n=89). Example areas of the detection results are presented in figures 3 and 4. The average number of detected TILs/tumour was 136 034 (range 34 to 1.618,878) and the median TIL density was 1009 TILs/mm2 (range 60–7092). The lower and upper tertiles were 625 and 2011 TILs/mm2, respectively. Three pathologists visually scored the same areas for percentage TILs in the tumour. The agreement between the algorithm and the three experts when dividing the patients into three equally sized groups (cutpoint at tertiles of either TIL density or percentage) according to TIL quantities is reported in table 1. The average Kappa value between the algorithm and the pathologists (Kappa=0.35) was on the same level as between the experts (Kappa=0.33).
TIL counts in relapsed versus non-relapsed patients assessed by automated method
Patients who survived without relapse had a median algorithm-based TIL density of 1104 TILs/mm2 in their primary tumour compared with 807 TILs/mm2 in relapsed patients (p=0.06) (figure 5). From the total of 89 patients with testicular germ cell tumour (59 seminomas and 30 non-seminomas), 7 seminomas and 8 non-seminomas relapsed during follow-up. Significantly fewer patients who belonged to the highest algorithm-based TIL density group had a relapse (p=0.036) when a cut-off at the upper tertile (2011 TILs/mm2) was applied (table 2). No corresponding statistically significant association between the TIL estimates and relapse status was seen in patients with seminoma when TILs were quantified by a pathologist (table 2). In non-seminoma, there was no statistically significant association between either algorithm-based or expert-based TIL quantity and relapse status (table 2).
Association of TILs with pathological variables
In the entire set of patients (n=89), a higher number of TILs/mm2 tumour was significantly associated with seminoma tumour type (p=0.003), lower clinical stage at presentation (p=0.02), lack of lymphovascular invasion (p=0.01), but not with any of the other pathological characteristics analysed (table 1). The higher % TILs in seminoma versus non-seminoma is what would be expected based on pathology observations (seminomas typically being more heavily inflamed than non-seminomas) and helps validate the results. In the set of non-seminomas (n=59), a higher number of TILs/mm2 tumour was significantly associated with a lower clinical stage at presentation (p=0.003), lack of rete testis invasion (p=0.03), but not with any of the other clinicopathological characteristics analysed (figure 6). Size of tumour in seminoma (larger tumour size being an adverse prognostic factor) was not of statistical significance, but larger tumour size (>40 mm tumour size) approached significance (p=0.06). Also, in seminoma, rete testis stroma invasion and lymphovascular invasion were associated with lower TIL counts (p=0.03 and 0.01, respectively). In the patients with non-seminoma (n=30), no statistically significant association between the algorithm-based TIL density in the tumour and the analysed pathological characteristics was found (table 3).
In the present study, we evaluated a deep learning–based approach to identify and count TILs in a series of digitised H&E-stained whole slides of testicular germ cell tumours. Also, our aim was to evaluate whether the TIL count could predict disease relapse in patients with testicular cancer. Currently, TIL scoring of H&E-stained slides is based on visual interpretation by an expert and thereby prone to substantial interobserver and intraobserver variations.14 15
Deep learning is an example of the machine-learning theory of feature learning where the method can iteratively improve from learnt exemplifications of the underlying data with the goal of achieving the maximum class separability.16 In a recent study using TCGA samples across 13 different tumour types, but not testicular cancer, it was shown that deep learning–based TIL algorithms can identify TIL patterns that are linked to molecular features and outcome.17 In this study, an average area of 70 mm2 of the tumour was analysed with the algorithm and in average more than 100 000 cells per sample were counted, which is several orders of magnitude more than would be feasible to count through visual assessment.
Non-seminoma shows a more variable distribution of inflammatory cell infiltrates with approximately equal number of B and T cells,18 and seminoma is typically more inflamed than non-seminoma with a higher number of T cells in seminoma than non-seminoma.19 Testicular germ cell tumours with high infiltration of PDL-1-positive TILs have been shown to have significantly better progression-free survival and overall survival.20 In a series of 150 men with testis confined (stage I), seminomas managed by surveillance where TIL count was categorised by a pathologist as high, intermediate or low showed that lower TIL count was of borderline statistical significance for relapse.21
Although the numbers were relatively few (because of disease rarity), a statistically significant association between low TILs and relapse was found. Often testicular germ cell tumour studies are designed with clinical stage at presentation as a surrogate for relapse because these data are not available and it is a strength of this study that we had both parameters and both showed statistical significance with low TILs in seminomas, which reinforces the accuracy of the algorithm. The algorithm ran more effectively on seminomas, where the tumour has a homogenous appearance rather than in non-seminomas where tumours are often heterogeneous with multiple different type subtypes. A portion of the non-seminoma slides failed regarding TIL detections mainly due to dense nuclei that resemble TILs, thus iteration of the algorithm especially for non-seminomas could further improve the results. The high variability in the background morphology of non-semonoma cases, however, could mean that hundreds of additional samples would need to be collected and therefore was not feasible in the context of the current study.
The significant results regarding disease relapse were all within the seminomas cohort and not the non-seminomas. There are only a few prognostic factors in seminomas, size of tumour being the only relatively consistent one across the literature, and this TIL assessment tool offers great promise in better stratification in patients with this tumour type and should be explored further in a larger cohort. Also, the slides in the current study were from two laboratories only and were scanned on a single scanner. Future studies should include tissue slides scanned with various scanners and from multiple laboratories to allow deep learning algorithms to learn a more extensive range of variation in patterns.
In this study, we have adopted deep learning–based automated methods for identifying TILs in testicular germ cell cancer and shown that a high accuracy in the automated detection of manually annotated TILs can be achieved. We show that, in this relatively small cohort, the TIL density in seminomas assessed by the proposed method has prognostic relevance regarding disease relapse which could not be achieved by pathologist assessment.
Take home messages
Deep learning–based classifiers are excellent candidates for automated quantification of TILs in high-throughput settings.
The tumour-infiltrating lymphocyte (TIL) count in seminomas has prognostic relevance regarding disease relapse.
A higher TIL density was associated with a lower clinical tumour stage, seminoma histology and lack of lymphovascular invasion at presentation.
Deep learning–based image analysis can be used for detecting TILs in testicular germ cell cancer more objectively and it has potential for use as a prognostic marker for disease relapse.
We thank Hakan Küçükel for development of the image processing pipeline.
Handling editor Dhirendra Govender.
Contributors CV, JL and NL conceived the concept of this study and supervised the study. JL and NL designed computational experiments. CV, JL and NL implemented the experiments. JL performed statistical analyses of the results. CV, NL and JL drafted the manuscript. CV collected the patient sample data and collected the clinical information. CV, RC, and RP performed the visual scorings. ML performed management and pre-processing of digital slides. JCT helped conceive the study and JJ collected the clinical data. All authors approved the manuscript.
Funding The research has received funding from the Sigrid Jusélius Foundation (JL), Medicinska Understödsföreningen Liv och Hälsa (JL), Stiftelsen Dorothea Olivia, Karl Walter och Jarl Walter Perkléns minne (NL) and Finska Läkaresällskapet (NL, JL). CV's research time is part funded by the Oxford NIHR Biomedical Research Centre (Molecular Diagnostics Theme/Multimodal Pathology Subtheme). The research was funded by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC).
Disclaimer The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
Competing interests JL and ML are founders and consultants at Fimmic Oy, Helsinki, Finland.
Patient consent Not required.
Ethics approval This manuscript reports a retrospective study of routinely collected samples conducted under London-Westminster REC approval 14/LO/2074.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.