Aims Virtual microscopy utilising digital whole slide imaging (WSI) is increasingly used in breast pathology. Histologic grade is one of the strongest prognostic factors in breast cancer (BC). This study aims at investigating the agreement between BC grading using traditional light microscopy (LM) and digital WSI with consideration of reproducibility and impact on outcome prediction.
Methods A large (n=1675) well-characterised cohort of BC originally graded by LM was re-graded using WSI. Two separate virtual-based grading sessions (V1 and V2) were performed with a 3-month washout period. Outcome was assessed using BC-specific and distant metastasis-free survival.
Results The concordance between LM grading and WSI was strong (LM/WSI Cramer’s V: V1=0.576, and V2=0.579). The agreement regarding grade components was as follows: tubule formation=0.538, pleomorphism=0.422 and mitosis=0.514. Greatest discordance was observed between adjacent grades, whereas high/low grade discordance was uncommon (1.5%). The intraobserver agreement for the two WSI sessions was substantial for grade (V1/V2 Cramer’s V=0.676; kappa=0.648) and grade components (Cramer’s V T=0.628, p=0.573 and M=0.580). Grading using both platforms showed strong association with outcome (all p values <0.001). Although mitotic scores assessed using both platforms were strongly associated with outcome, WSI tends to underestimate mitotic counts.
Conclusions Virtual microscopy is a reliable and reproducible method for assessing BC histologic grade. Regardless of the observer or assessment platform, histologic grade is a significant predictor of outcome. Continuing advances in imaging technology could potentially provide improved performance of WSI BC grading and in particular mitotic count assessment.
- breast pathology
- virtual microscopy
Statistics from Altmetric.com
Virtual microscopy (VM) using digital whole slide imaging (WSI) is a technology through which glass slides of pathological specimens are digitally scanned at high resolution for viewing on a computer screen. Applications of WSI in the clinical, educational and research settings including image analysis applications are increasing and in some centres WSI has replaced conventional microscopy as a diagnostic tool used by pathologists.1–6 However, one of main the concerns related to VM adoption in breast pathology, in addition to diagnosis, is the assessment of prognostic and predictive variables including histologic grade.6 There is a perception that the quality of the images displayed by WSI may interfere with reliable histologic grading. In addition, the interpretive ability of the reporting pathologist assigning a ‘virtual grade’ to each cancer remains largely unknown.
Therefore, to improve WSI performance, enhancement of the WSI platform and the training of histopathologists with the digital environment is recommended. However, testing performance and reproducibility of WSI in cases’ reporting is critically needed. This could be achieved via head-to-head comparison of WSI compared with traditional light microscopy (LM) to provide sufficient evidence prior to clinical adoption.
Grading of breast cancer (BC) using the Nottingham combined histologic grade is one of the strongest prognostic factors in early-stage disease.7–9 Grade comprises one of the main components of several management decision tools10–13 and it has recently been included in the American Joint Committee on cancer tumour, node, metastases staging system as a stage modifier.14 15 However, concordance of BC grading among pathologists using glass slides shows moderate agreement with kappa values of 0.4816 to 0.5317; the high concordance rates observed in grade 3 (kappa 0.60) and grade 1 (kappa 0.51) tumours, whereas the lowest is observed with grade 2 (kappa 0.33) tumours.16 The impact of introduction of WSI in routine practice on the concordance of grade and its performance as a prognostic factor remains to be defined.
Therefore, this study aims at comparing the histologic grading of BC as assigned by an expert pathologist using WSI with the grade assessed in routine practice using LM. In addition to assessment of concordance, impact of different grading platforms on patient outcome was evaluated using the large well-characterised Nottingham BC cohort.
Patients and methods
This study has been performed on a large series (n=1675) of patients with early-stage invasive primary operable BC presented to Nottingham City Hospital from 1999 to 2006. This is a well-characterised cohort of BC with long-term clinical follow-up (median 135 months) and detailed clinicopathological profiles. Data included primary tumour histologic grade and grade components, tumour size and histotype, lymph node stage, nodal status, lymphovascular invasion, Nottingham Prognostic Index, molecular subtypes and outcome data were collected. The latter includes breast cancer-specific survival (BCSS), defined as time (in months) from the date the primary surgical treatment to the time of death from BC, and distant metastasis-free survival (DMFS) was defined as the time (in months) from the surgery until the first event of distant metastasis. Patient and tumour demographics are summarised in table 1.
This tumour cohort was originally graded using the Nottingham grading system during routine pathology reporting utilising all available tumour glass slides (average four slides per case) and LM.8 For the purpose of this study, data for the final grade and the individual grade components (tubule formation, nuclear pleomorphism and mitotic count scores) were retrieved from the patients’ records. One to three tumour blocks per case were retrieved, and freshly prepared H&E slides were reviewed. A representative slide per case was selected by a specialised breast pathologist (EAR) without further glass slide grading. Glass slides were scanned into high-resolution (0.19 µm/pixel) digital images at ×20 magnification using 3DHistech Panoramic 250 Flash II scanner (3DHISTECH, Budapest, Hungary). The digital WSIs were generated, stored and viewed using the 3DHistech Pannoramic Viewer (3DHISTECH; http://www.3dhistech.com/downloads) on a high-resolution screen. The digital slide was graded using College of American Pathologists’ criteria18 which are essentially the same as the original Nottingham criteria.7 Digital images were initially examined at low magnification where tubule formation was assessed. Also, low-to-intermediate magnification was performed for the identification of potential ‘hotspots’ for mitotic counting. Essentially, for mitotic counting, the distance measure tool of the software was used. This was important for determining the number of mitotic figures in a given area.
To allow for intraobserver agreement of BC grading using WSI, the whole cohort was graded again by the same observer (LWD, who is an experienced breast pathologists with special interest in BC grading) using the same criteria twice after a 3-month washout time with no special training during that time. In both WSI grading sessions (V1 and V2), grade components were assigned blinded to the LM grade as well as other clinicopathological parameters.
Statistical analysis was performed using functions obtained from the open-source R statistical platform.19 Since WSI and LM have a procedural difference, Cramer’s V statistic was adopted to help judge strength of concordance.20 The coefficient ranges from 0 (no association) to 1 (perfect association). The kappa statistic is technically a measure of concordance between two observers who are examining the same parameter following the same approach.21 For these analyses, R library vcd had the function required for calculating the Cramer’s V statistic (=function assocstats), while the function for kappa statistic was obtained from R library inter-rater reliability. Survival analysis was performed using SPSS V.23 for Windows using log-rank test and Kaplan-Meier plots. Survival analysis (BCSS and DMFS) was performed on WSI grade as well as the WSI component scores. Likewise, survival analysis was performed for the glass-slide LM grade to include separate analysis of the component scores. Multivariate analysis was performed using Cox proportional hazards analysis with inclusion of parameters significantly associated with outcome in univariate analysis. Statistical significance in survival stratification was calculated by the log-rank method and univariate Cox regression analysis. A p value of <0.05 (two tailed) was considered significant.
In this study, a large (n=1675) retrospective cohort of early invasive primary operable BC was graded on high-resolution digital images acquired through WSI of representative slides. For this WSI grading, as for the original LM, the three-tier histologic grade of BC was used.7 8 18 Tables 2 and 3 show cross comparison of WSI grade with LM grade as well as the cross comparison of the three components of the Nottingham grade. Table 4 shows the cross comparison of the sum of grade components (3–9 scale) for LM grade scores and WSI grade scores.
The agreement between WSI grading and glass slide/LM grading was moderate for the both WSI grading sessions when kappa statistic was used (V1/LM kappa=0.51 and V2/LM kappa=0.50). However, when Cramer’s V statistic was used the Cramer’s V for WSI with LM was 0.58 in both sessions respectively which is considered as a substantial concordance. If grade is reduced to a binary level of high (ie, grade 3) versus not high (ie, grades 1 and 2), the Cramer’s V was 0.66. The unweighted kappa statistic for WSI grade with LM grade was 0.51. The kappa statistics for component scores were as follows: mitoses=0.47; tubules=0.49; and pleomorphism=0.3.
Importantly, exact grade agreement between WSI and LM grading was reached in 68% of cancers. There was almost an even match in the number of cancers with low–intermediate discordance (255 cancers; 15.1%) as compared with intermediate–high discordance (265 cancers; 15.7%). There was 32.3% discordance between WSI and LM grade, which is largely between adjacent levels of grade: low versus intermediate, or intermediate versus high grade. The binary high versus low/intermediate discordance of grade was 17%. In this study, only 26 (1.5%) grade assignments were attributable to high versus low-grade discrepancy and the proportion reduction of high in LM to low in WSI was highly significant (p<0.00001).
The intraobserver agreement for the two virtual sessions (V1/V2) was higher than the values of agreement between WSI and LM but remained in the moderate concordance category (table 5).
Survival analysis was performed on both the grade assigned by WSI and the original LM (table 5). WSI grading in both grading sessions showed statistically significant differences for BCSS and DMFS as did the LM grading (p=1×10−13) (table 6). Individual WSI grade components showed statistically significant differences for BCSS and DMFS. WSI tubule formation showed a stronger association with BCSS than that of LM (HR=2.8, 95% CI 1.9 to 4.0 and HR=1.9, 95% CI 1.5 to 2.4, for WSI and LM, respectively). Similar results were observed for DMFS (HR=2.6, 95% CI 1.9 to 3.6 and HR=1.7, 95% CI 1.4 to 2.1). Figures 1 and 2 show survival curves of the final WSI and LM-based histologic grade as well as grade components and BCSS.
To assess the prognostic independency of BC grade assigned using LM and WSI V1 and V2, multivariate analyses were performed including other established prognostic variables in the models. LM grading as well as WSI V1 and V2 were significantly associated with BCSS (p value for the three grading methods were <0.001) and DMFS (p value <0.001), independent of other variables (table 7).
Currently, there is an increasing interest in using WSI for diagnostic and research purposes. However, it is crucial to ensure that diagnostic performance using virtual slides is at least equivalent to that of using conventional LM. To validate the diagnostic concordance of WSI and LM, The United States and Canadian Academy of Pathology recommends 12 rigorously developed guidelines with the potential of providing pathology laboratories with a practical guide to validate WSI systems for diagnostic work.22 These include, yet not limited to, the number required for double reporting (at least 60 cases per application) and the washout period (at least 2 weeks). In the current study, more than 1600 BCs were regraded using WSI by expert pathologists and the results were compared with the original routine practice generated grade. The quantifiable three-tier system of Nottingham grade compiling the degree of tubule formation, nuclear pleomorphism and mitotic frequency scores is an ideal parameter for comparing WSI with LM. To assess the intraobserver concordance and the impact of WSI training, the whole cohort was grading again with a long washout interval of 3 months. The endpoint for this study was the concordance statistics as well as the patients’ clinical outcome. To the best of our knowledge, this is the largest study performing head-to-head comparison of BC grading using WSI and LM including patients’ survival as a study endpoint.
WSI grading showed moderate concordance with LM grading comparable to concordance rate reported among different pathologists who graded BC using conventional microscopy.16 17 Exact grade agreement between WSI and LM grading was reached in 68% of cases. This magnitude of concordance is in line with a prior reproducibility study.23 Since WSI has a procedural difference, compared with LM, some emphasis was given to Cramer’s V as measure of concordance. Multiple authorities considered a Cramer’s V of a value greater than 0.5 to be the breakpoint for acceptable concordance.24 25 In the current study, WSI grade as compared with LM grade had Cramer’s V of 0.58 at the ternary level and 0.66 at a binary level. These figures indicate high levels of reproducibility and demonstrate WSI reliability as a platform for grading BC taking into account the inherent discordance levels in grade assignment between different observers using a single platform. Detailed analysis of discordance at the level of individual cases awaits further study to include evaluation by recently introduced technologies.
In this study, the true merit of WSI as compared with LM was further studied as regard to whether both offered a comparable level of patients’ survival stratification using the large number of cases with long-term follow-up data. Both WSI and LM showed significant association with patients’ outcome and individual grade components assessed by both platforms. Interestingly, tubule formation as assessed by WSI showed stronger association with outcome compared with VLM assessment. Of note, our study demonstrated that morphology is easy enough to be amendable to survival analysis while technically difficult molecular assays are not.26 27
The intraobserver agreement for the two WSI sessions was moderate and showed similar association with outcome. These results support the fact that the level of concordance is to a large extent related to observer performance and the subjective nature of grade rather than the platform used. This together with the limitation of the current study which include (1) grade was assessed by different observers, (2) original grade was assessed using an average of four tumour tissue slides per case whereas WSI grade was assessed on a single slide and (3) WSI scan magnification used was ×20 rather than ×40 which is considered ideal for assessment of mitotic counts. In fact, among the three Nottingham grade components, the most challenging component to evaluate by WSI was mitotic counts. There was difficulty in discerning mitotic figures from apoptotic cells. Although this was largely attributed to resolution, the inability of WSI to provide different planes may have contributed as an additional hurdle. Therefore, assessment of mitotic counts using ×40 magnification may help resolving this issue. However, the large number of cases in this study and the repeated grade by the same observer using WSI has potentially overcome these limitations. The reasons for the tendency of lower mitotic scores in WSI compared with LM are likely related to the use of a single slide per case and the lower magnification used in WSI.
In BC grading there will be, without doubt, some discordance between grade assignments by WSI as well as to LM grading. Also, comparisons among biomarkers tested for diagnostic and research purposes share this possibility of discordance.25 26 28 However, at the level of an individual patient especially in the diagnostic setting, discordance is usually met with caution and concordance is sought for. Therefore, sustained effort is critically needed for improvement in concordance, or at least for an improved understanding in the meaning of discordance. In the current study, grading was validated as a ternary scheme and as binary scheme to assess for concordance of both grading platforms. Previous studies addressing binary biomarkers have compared their results with grade by collapsing grade into a binary scheme. For illustrative purposes, we did the same, and showed strong concordance of WSI with LM, no matter if low and intermediate grade were combined, or intermediate with high grade; concordance of grade was Cramer’s V=0.55 if low combined with intermediate.
As mentioned above, two reasons are thought to be responsible for underperforming of WSI in the assessment of histologic grade which are the technology itself or WSI and the reader. This study demonstrates that grading using WSI is reproducible and provides significant survival information comparable to glass slides. The concordance rate between glass slides grading and WSI was comparable to these reported using glass slides as the only tool and the intraobserver concordance using WSI was even higher than that reported by multiple readers using glass slides.29 30 This study in addition to providing evidence for the reproducibility and reliability of WSI in grading BC could prompt the question of what would be the minimal number of cases, randomly selected, which would be expected to show if a histopathologist would show ability to predict survival using WSI grade. If low enough, then WSI may be a method to test competence at the level of survival prediction and not just concordance. The use of WSI technology also opens up opportunities for computer-assisted classification of histologic grade with inherent improved standardisation and reproducibility of evaluation and potential for refinement of methodology.
Take home messages
Regardless of the observer or assessment platform, histologic grade is a significant predictor of outcome.
Virtual microscopy is a reliable and reproducible method for assessing breast cancer histologic grade.
Higher magnification (×40) is recommended to produce adequate resolution for an accurate grading.
Continuing advances in imaging technology could potentially provide improved performance of whole slide imaging breast cancer grading and in particular mitotic count assessment.
We thank the Nottingham Health Science Biobank and Breast Cancer Now Tissue Bank for the provision of tissue samples.
EAR and MA contributed equally.
Handling editor Dhirendra Govender.
Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Ethics approval This study was approved by Nottingham Research Ethics Committee 2 under the title ‘Development of a molecular genetic classification of breast cancer’, and in compliance with current ethical and legal guidelines of the UK.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Data are available upon request and at the discretion of the authors.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.