Article Text

This article has been retracted. Please see:

Download PDFPDF

Retracted: Immunohistochemical prognostic markers in diffuse large B-cell lymphoma: validation of tissue microarray as a prerequisite for broad clinical applications (a study from the Lunenburg Lymphoma Biomarker Consortium)
  1. D de Jong1,
  2. W Xie2,
  3. A Rosenwald3,
  4. M Chhanabhai4,
  5. P Gaulard5,
  6. W Klapper6,
  7. A Lee7,
  8. B Sander8,
  9. C Thorns9,
  10. E Campo10,
  11. T Molina11,
  12. A Hagenbeek12,
  13. S Horning13,
  14. A Lister14,
  15. J Raemaekers15,
  16. G Salles16,
  17. R D Gascoyne4,
  18. E Weller2
  1. 1The Netherlands Cancer Institute, Amsterdam, The Netherlands
  2. 2Dana-Farber Cancer Institute, Boston, Massachusetts, USA
  3. 3Institute of Pathology, University of Würzburg, Würzburg, Germany
  4. 4British Columbia Cancer Agency, University of British Columbia, Vancouver, Canada
  5. 5Department of Pathology, Inserm U617, Hôpital Henri Mondor, Créteil, France
  6. 6Department of Pathology, Hematopathology Section, University Hospital Schleswig-Holstein, Campus Kiel, Germany
  7. 7CRUK Medical Oncology Unit, St Bartholomew’s Hospital, London, UK
  8. 8Karolinska Institutet, Stockholm, Sweden
  9. 9University Clinic Schleswig-Holstein, Campus Lübeck, Germany
  10. 10Hospital Clinic, University of Barcelona, Barcelona, Spain
  11. 11Université Paris-Descartes; AP-HP, Hôtel-Dieu, Paris, France
  12. 12Academic Medical Center, Amsterdam, The Netherlands
  13. 13Stanford University Medical Center, Palo Alto, California, USA
  14. 14St Bartholomew’s Hospital, Department of Medical Oncology, London, UK
  15. 15University Medical Center Nijmegen, The Netherlands
  16. 16Hospices Civils de Lyon and Université Claude Bernard Lyon-1, Lyon, France
  1. Dr De de Jong, Department of Pathology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066CX Amsterdam, The Netherlands; d.d.jong{at}nki.nl

Abstract

Background and Aims: The results of class prediction and the determination of prognostic markers in diffuse large B-cell lymphoma (DLBCL) have been variably reported. Apart from biological variations, this may be caused by differences in laboratory techniques, scoring definitions and inter- and intra-observer variation. In this study, an international collaboration of clinical lymphoma research groups has concentrated on validation and standardisation of immunohistochemistry of the currently potentially interesting prognostic markers in DLBCL.

Methods: Sections of a tissue microarray with 36 cases of DLBCL were stained in eight laboratories with antibodies to CD20, CD5, bcl-2, bcl-6, CD10, HLA-DR, MUM-1 and Ki-67 according to local methods. The study was performed in two rounds, firstly focused on the evaluation of laboratory staining variation, and secondly on the scoring variation.

Results: Different techniques resulted in highly variable results and poor reproducibility for almost all markers. Reproducibility of the nuclear markers was highly sensitive to technical variations, including immunological enhancement techniques (agreements 34%). With elimination of variation due to staining and uniformly agreed on scoring criteria, significant improvement was seen; however less so for bcl-6 and Ki-67 (agreement 53–58%). Absence of internal controls that preclude scoring, significantly influenced the results for CD10 and bcl-6.

Conclusion: Semi-quantitative immunohistochemistry for subclassification of DLBCL is feasible, but with varying rates of concordance for different markers and only using optimised techniques and strict scoring criteria. These findings may explain the wide variation in prognostic impact reported in the literature. Harmonisation of techniques and centralised consensus review appears mandatory when using immunohistochemical biomarkers for treatment stratification.

View Full Text

Statistics from Altmetric.com

The use of immunohistochemical methods has become part of the routine diagnostic procedure in several tumours, and even more essential in lymphoma. In the current classification of non-Hodgkin lymphoma, the immunophenotype constitutes an integral and very important part of the features used for classification, together with morphology, genetic characteristics and clinical findings. Apart from their use in classification, immunohistochemical prognostic and predictive biomarkers have become increasingly important for determining an expectation of survival. In the last 10 years, several markers have been identified that influence patients’ prognosis. This has led to the proposed use of these markers for risk stratification of lymphoma patients and to develop specific therapeutic strategies.

Since the recognition of two biological subtypes of diffuse large B-cell lymphoma (DLBCL) on the basis of gene-expression profiling,1 ,2 the exploration of the clinical relevance of this sub-typing has been the subject of many studies. The prognostic stratification of the germinal centre B-cell-like (GCB) and activated B-cell like (ABC) has been reproducible in most gene-expression studies by different groups.3 ,4 Immunohistochemistry using a limited number of markers may be an attractive alternative technique that enables exploration in larger retrospective and prospective studies, in uniformly treated series from clinical trials and in more rare and specific patient populations. Moreover, rather than using a gene-expression based approach, an immunohistochemical method might be highly suitable for implementation in daily patient management in the future. Several groups of authors have used similar approaches to translate the biological information of the GCB-like versus non-GCB-like subtypes in sizable patient series.512 Although the markers and algorithms were very similar, the results were remarkably variable. Some of these groups found a significant prognostic value using the immunohistochemical class prediction similar to the gene-expression method, while others did not. The inconsistency in results not only applies to the rather complex biological subtype stratification, but similarly holds true for single immunohistochemical prognostic markers such as bcl-2, bcl-6, survivin, FoxP1 and Ki-67 in DLBCL.1323

Several biological and technical factors may underlie the inconsistent findings for the prognostic value of markers in DLBCL. These include: (1) selection of specific patient series (specific age groups, relative contribution of nodal versus extra-nodal disease); (2) treatment factors (non-uniform treatment, +/− rituximab); (3) laboratory technical variations (such as different primary antibodies, various antigen retrieval and signal amplification techniques); (4) scoring criteria and definitions; and (5) inter- and intra-observer variations.

In October 2003, the Lunenburg Lymphoma Biomarker Consortium (LLBC) was instituted as an international collaboration of nine leading European, American and Canadian clinical lymphoma research groups to unite their efforts in translational research.24 The main research focus was to determine those biomarkers thought to be important for prognosis in DLBCL. Prior to launching a comprehensive study on biopsy samples from patients treated in clinical trials, the group has concentrated on validation and standardisation of immunohistochemistry of the currently interesting prognostic markers in DLBCL. This effort has focused on the evaluation of technical and inter-observer variation in the context of strict scoring criteria and definitions.

MATERIALS AND METHODS

Tissue microarray construction

Tissue microarrays (TMAs) were prepared at the Department of Pathology of the British Columbia Cancer Agency from 36 representative cases of DLBCL and two tonsil samples with adequate archival formalin-fixed and paraffin-embedded material retrieved from six different laboratories and collected between 1984 and 2004. Representative 0.6 mm cores were taken and re-embedded in duplicate per recipient block. Six identical recipient blocks were constructed from the same donor blocks. All TMAs were checked to be equally representative for each case. Sections (5 μm) were then cut from each TMA in eight laboratories and stained with antibodies to CD20, CD5, bcl-2, bcl-6, CD10, HLA-DR, MUM-1 and Ki-67 according to locally optimised protocols (details in table 1). One aim of the study was to evaluate the variation in the results obtained with staining in different labs as done in daily practice.

Table 1 Primary antibodies and protocols

Criteria and scoring methods for immunohistochemistry

Each core was evaluated for percentage of tumour cells stained by visual estimation and the maximum of the two cores was recorded. The LLBC pathologists convened twice to determine and refine the criteria for scoring prior to the first rotation round. Table 2 lists the scoring categories. Strong emphasis was put on the presence of internal controls as a prerequisite for scoring. In the absence of an internal control, the case was considered unscorable. For CD20 and CD5, HLA-DR and bcl-2, reactive B-cells and T-cells respectively, which are always present in DLBCL served as internal controls. For CD10, sporadic granulocytes and stromal fibroblasts could be used as internal controls. These could be rather sparse, however. For bcl-6 and MUM-1, reactive T-cells were adequate as internal controls but often sparse; careful attention to their presence was required. Internal controls for Ki-67 could not be indicated, but since DLBCL always has a sufficient number of tumour cells in cycle, lack of internal control was not anticipated to cause a problem.

Table 2 Scoring criteria for immunohistochemistry in the second rotation round

The Ki-67 staining from lab 5 and the MUM-1 staining from lab 4 were not considered due to very suboptimal technical results precluding evaluation. Lab 3 did not perform an HLA-DR staining. In the first rotation round, the set of eight immunohistochemical stains from each laboratory was scored by the local pathologist and two other pathologists to assess staining and scoring variation.

Based on the results from the first rotation round which showed significant staining effects, LLBC pathologists (DdJ, AR, MC, PG, WK, BS, CT, EC, TM) convened a meeting to discuss the results and to directly compare all available stained TMA slides. When comparing all results per staining, the optimal staining per marker was selected using the results of the first round as a guideline and according to the presence of minimal artefacts and as most representative of the expected biological range/variation of the marker in DLBCL. Optimal stainings were selected by consensus from the complete collection to a set of 10 slides, one for each marker, with the exception of bcl-6 and CD5 for which two slides were selected. Subsequently, scoring variation was further evaluated in a second rotation round in which all nine pathologists scored these optimal stainings in the same slide set. For CD5 and bcl-6, two stains were selected, that differed in staining characteristics, precluding interpretation on expected best scoring reproducibility. For CD5, identical scoring criteria were applied on both stains. For bcl-6, two different sets of scoring criteria were used: one based on cell percentages and one based on staining intensity. A “scoring manual” was constructed as an additional guideline.

Statistical analysis

Measures used to evaluate agreement included overall agreement between pairs of labs as well as the proportion of patients for whom all scores agree. The pair-wise agreement was adjusted for the expected proportion of agreement assuming the scoring laboratories were independent using the generalised κ statistics.25 The level of agreement for the κ statistic was evaluated based on the following ranges: ⩽0 poor, 0.0–0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial and 0.8–1.0 almost perfect. The standard error of the generalised κ statistic was estimated using the bootstrap method with 2000 replications.26 Resampling was performed at the patient level to conserve the correlation structure of the scores within a patient. The bootstrap confidence intervals were computed based on the percentiles of the bootstrap distribution of the statistic.26 The agreement measures were evaluated including or excluding the slides that were not scored. In the first round, the average overall and pair-wise agreement percentages across the markers were compared among the staining laboratories using the Kruskal–Wallis test; the coefficient of variation (CV) was reported.

We further evaluated whether combining biologically homogeneous categories could improve agreement. This was performed for bcl-6, CD5 and Ki-67 in the second rotation round. Using the most generally applied algorithm to distinguish GCB versus non-GCB DLBCL on the basis of immunohistochemistry for CD10, bcl-6 and MUM-1, agreement and generalised κ statistics among labs was performed for data from both rotation rounds. A cut-off level of 25% was used for each marker as a positive score in the GCB versus non-GCB classification.

RESULTS

Agreement across staining and scoring laboratories

Agreement results are summarised for the first rotation round in table 3 and fig 1, and for the second rotation round in table 4. The agreement results are also presented excluding the slides which could not be scored. The reasons slides could not be scored varied for different markers and are summarised for each marker below. In the first round, the average pair-wise agreement percentage between the three pathologists across markers was 71, 67, 66, 75, 67, 65, 64, and 77 for staining labs 1–8, respectively (p = 0.9, Kruskal–Wallis test). Similarly, the average complete agreement was 67, 60, 56, 56, 67, 50, 51 and 52 for staining labs 1–8, respectively (p = 0.9) This indicates that the eight staining laboratories produced the same degree of average agreement. However, for some markers the agreement was low and the estimated standard deviation was relatively high compared to the mean (CV 21–31 for all markers except CD20 and HLA-DR, with agreement for one lab different from the other seven; fig 1) resulting in further detailed evaluation of the individual stains. The majority of the markers show a difference in agreement between both rounds with improvement in the second round compared to the first round. However, the agreement remained low to moderate for some markers. Technical and staining aspects that could influence the scoring are discussed individually for each marker below.

Figure 1 Pair-wise agreement percentages among three pathologists by each staining laboratory for the first rotation round for markers with higher levels of agreement (A) and markers with lower levels of agreement (B). Results are shown with the not scored slides included and excluded in the agreement calculations. The coefficients of variation (CV) across scores for a staining laboratory are computed with (CV1) and without the not scored category (CV2).
Table 3 Agreement percentages and the generalised κ statistic from the first rotation round combined across staining laboratories
Table 4 Agreement percentages and the generalised κ statistic across nine scoring pathologists in the second rotation round

CD20

Only CD20 stands out as a uniformly reliable marker to score for almost all stains and for all pathologists (fig 3A), resulting in complete agreement in the second round. However, it should also be noted, that for CD20 some staining variation was present in the first round in terms of staining overall intensities, background levels and artefacts, resulting in incidental unscorable cores (up to 9% in the first round due to high background in the staining performed in lab 5) and a single discordant case, which was negative in only 1/8 CD20-stained TMA slides. The presence of nucleolar artefacts was laboratory dependent (fig 2A).

Figure 2 Distribution of the scores of the nine pathologists from the second rotation round for Ki-67 (A) and CD5 (B) from a lab using amplification protocols (x-axis is the percentage of the scores in each category).
Figure 3 (A) CD20 generally shows strong membranous staining (A1). However, in rare cases, a more variable and weak staining may be seen with a nucleolar artefact, that should not be considered in scoring and not be regarded as positive in the absence of membranous staining (A2). (B) Two different CD5 stainings on the same case in the same tissue microarray show strong enhancement with the use of Powervision (B2) compared to a standard ABC/DAB technique. (C) Internal controls for CD10 can be very sparse and consist of stromal fibroblasts (C1) and granulocytes (C2). (D) bcl-2 can be uniformly strong in diffuse large B-cell lymphoma, but also stain positive in a minority of the tumour cells and with varying intensity, or stronger (D1) or weaker (D2) than internal control T-cells.

HLA-DR

HLA-DR was also reproducible in terms of staining or scoring, with agreement of 86–92% for all staining laboratories except one in the first round, and of 95% (95% CI 86% to 99%) in the second round. One single lab (lab 7) stood out, with up to 34% unscorable cases due to high background staining in the first round. This lab routinely used an immunological amplification step for immunohistochemical visualisation (ChemMate, DAKO), resulting in a very strong signal, but at the cost of high background staining in some markers, including HLA-DR.

CD10

CD10 may be considered as a rather reliably scored marker with an agreement for pairs of labs of 87% in the first round and 95% (95% CI 88% to 99%) in the second round for interpretable cases. However, in suboptimal stains, the sparsely present internal controls (stromal fibroblasts and granulocytes, fig 3C) may be lost, resulting in a high percentage of unscorable cases (up to 39% in the first round), as also reflected by the poor agreement results with the inclusion of the not scored cases in the first round (pair-wise agreement of 65% vs 87% with and without the not scored cases, respectively). With the optimal staining in the second round, results were excellent (pair-wise agreement of 87% vs 95% with 3.7% of non-scorable cases due to lack of internal controls).

CD5

CD5 staining was strongly influenced by technical variation despite the use of the same antibody clone in all labs (4C7 from different manufacturers). The use of standard ABC/DAB visualisation produced dramatically different results (pair-wise agreement 91–98%) than visualisation with maximised enhancement systems (ChemMate, DAKO; Powervision, Immunovision Technologies; pair-wise agreement 68–69%) with far stronger membranous staining at the cost of increased intracytoplasmic background staining (fig 3B). Experience with flow-results indicates that this staining is actually specific and should be considered as positive (B Sander and RD Gascoyne, personal communication). Therefore, for the second rotation round both maximised stains were included. The second round results show, however, that extreme enhancement introduces an unacceptable level of background staining and therefore is achieved at the expense of a high percentage of unscorable cases (11% for lab 7, fig 2). CD5 was initially considered in five scoring categories. The distribution of cases over these categories showed at most 9% of the scores in each of the three intermediate categories (1–25%, 26–50%, 51–75%) (fig 2B). This suggested that the biological dichotomy could be placed at a single higher level (⩾75%), with all cases with less than 75% positive tumour cells considered as negative. This approach improved the inter-observer agreement from 71% (95% CI 56% to 88%) with five categories to 86% (95% CI 75% to 96%) with two categories, but remained at moderate level per the κ statistic (κ = 0.43 vs 0.45, including not scored cases).

bcl-2

bcl-2 showed only fair agreement in the first rotation round due to staining variation (pair-wise agreement of 47%, κ = 0.23) and a relatively high percentage of not scored cases that stood out in two of the stains (13% and 15%). While all other labs used the same antibody, these two used a different clone from another manufacturer that seemed to produce more non-specific background staining, precluding any reliable scoring. Moreover, one of the labs used a very different visualisation technique (APAAP). In the second round with the optimal stain, only 1% of samples could not be scored and moderate agreement could be reached (70% pair-wise agreement, 95% CI 56% to 87%, κ = 0.45). As an alternative approach, the intensity of cytoplasmic staining in tumour cells compared to reactive T-cells in the same sample was considered (fig 3D). This could be performed with somewhat better reproducibility (74% pair-wise agreement, 95% CI 65% to 86%, κ = 0.51). To determine whether this feature is of possible biological and prognostic relevance remains to be studied in the context of a clinical series.

MUM-1

The nuclear markers, bcl-6, MUM-1 and Ki-67, proved to be the markers most influenced by laboratory variation. For MUM-1, the problem was not the scorability, since internal controls were virtually always present. In some labs, non-specific cytoplasmic background staining and “target-like” artefacts were a major problem, precluding adequate scoring (up to 58%). When all pathologists scored the same optimal stain, only moderate agreement (pair-wise agreement of 54%, 95% CI 45% to 68%, κ = 0.41) could be reached and especially classification in the higher frequency classes was highly variable between pathologists (9–46% with scores >75%).

Ki-67

The reproducibility of scoring for Ki-67 was poor in the first rotation round (35% pair-wise agreement, κ = 0.14). Several artefacts were encountered as illustrated in fig 4. When eliminating the laboratory staining variation, the reproducibility of the scoring for Ki-67 did improve in the second round (58% pair-wise agreement, 95% CI 54% to 69%, κ = 0.39) and a high proportion of slides were scorable (94%). However, the distribution of the scores across the categories was skewed towards the higher scores (fig 2). Also in the second round, the percentage of patients with scores in the two higher categories (76–95% and >95%) was highly variable (percentage of patients >95% ranged from 15% to 52% for the nine scoring labs). Indeed, when considering these two categories together in a four-category scoring scheme, a moderate agreement (83% pair-wise agreement, 95% CI 71% to 94%, κ = 0.58) could be reached and the percentage of patients with scores >75% was less variable (66% for one lab and 73–79% for eight scoring labs).

Figure 4 Artefacts in Ki-67 that preclude adequate scoring consist of very weak and heterogeneous staining with disproportional strong staining of mitoses (A1) compared to an adequate stain of the same core in the same tissue microarray (A2). Staining of apoptotic fragments (A3) and cytoplasmic background (A4) should not be considered as positive. Note the negative mitosis in A4 indicative of an inadequate staining.

bcl-6

bcl-6 was found to be the most variable and most difficult marker to score (fig 5). Despite use of the same primary antibody, the staining results varied dramatically, resulting in pair-wise agreement of 34% and κ = 0.17. Different laboratory techniques were found to strongly influence the level of sensitivity of the staining. Two labs (lab 3 and lab 7) produced positive staining with bcl-6 in virtually all cases of DLBCL when using immunological amplification techniques (ChemMate Detection Kit, DAKO and Powervision, Immunovision Technologies), while all other labs obtained some negative cases and a range of positive cases. Uniform expression of bcl-6 in DLBCL is in line with expression data using RNA-based techniques in which all DLBCL are shown to produce bcl-6 to some extent. Therefore, the staining variation reflects the sensitivity of the different techniques and the gradients of expression. The essential different staining characteristics preclude full comparison in the first round.

Figure 5 The results for bcl-6 on the same cores on the same tissue microarrays are compared in panels (A), (B) and (C), respectively, with a standard ABC/DAB technique and with a protocol that includes an immunohistochemical amplification (panels 1 and 2, respectively). Corresponding decreased intensity is seen in the less sensitive technique with increasing numbers of bcl-6 staining tumour cells falling below the detection level of the staining. In panel (C), no positive tumour cells are seen despite weak, but distinctly positive staining in the more sensitive technique (C2). Enhancement techniques may result in stronger staining at the cost of morphological artefacts (D1). Increased cytoplasmic background staining may preclude reliable scoring (D2).

In the second round, two different scoring systems were used for the two patterns; based on intensity and relative percentages of positive cells. The classical method based on percentages resulted in moderate agreement (pair-wise agreement of 53%, 95% CI 49% to 67%, κ = 0.42), but at the cost of an unacceptably high percentage of not scored cases (15%) due to not-representative staining (absence of internal controls). Intensity scoring also yielded moderate agreement (pair-wise agreement of 80%, 95% CI 69% to 91%, κ = 0.58) when dichotomised in very simplified “weak” versus “strong” categories, but with a lower percentage of not scored cases (4%).

Reproducibility of GCB versus non-GCB DLBCL class-assignment

The combined analysis of CD10, bcl-6 and MUM-1 according to set algorithms may be a surrogate for the gene-expression signatures of the prognostically relevant classes of GCB- versus non-GCB-like DLBCL. Staining and scoring variations may have a direct effect on the reproducibility of the immunological class distinction. The overall results in the first round were dominated by the large proportion of unscorable cases for CD10 (up to 39%) and the variation in staining characteristics due to signal amplification for bcl-6. Therefore, pair-wise agreement only reached 57% (95% CI 50% to 68%, κ = 0.36).

For the second rotation round with the optimal stains, pair-wise agreement increased to 77% (95% CI 64% to 90%, κ = 0.62). Exclusion of cases that could not be scored for one or more of the relevant markers according to the set criteria, however, resulted in pair-wise agreement of 89% (95% CI 77% to 98%, κ = 0.77).

DISCUSSION

Modern approaches to cancer treatment are increasingly driven by biological insights aimed at tailoring therapy. Therefore, more demand than ever is put on the pathologist to provide reproducible and reliable information on biomarkers and biological subclassification of distinct tumour types. In this study we explored variations introduced by laboratory techniques, inter-observer variations and scoring reproducibility using a set of established and potentially important immunohistochemical markers for DLBCL. Despite the fact that all stains were performed in experienced laboratories with a special focus on haematopathology, laboratory variations had a major impact on levels of agreement. When eliminating the staining variation, scoring proved to be highly reproducible between pathologists for several markers such as CD20, CD10 and HLA-DR and reasonably so for MUM-1, bcl-2 and CD5. However, for other markers including Ki-67 and bcl-6, the reproducibility was at a lower level even with exclusion of the staining variation. The results are generally in line with the immunohistochemical validation study performed by the Lymphoma/Leukemia Molecular Profiling Project.27 However, this study also provides a detailed assessment of the variation sources per marker, the possible consequences and recommendations for future studies and daily practice. Obviously, it is a matter of debate what level of reproducibility, both in terms of technical results and scoring performance, can be considered as acceptable for immunohistochemical studies.

Technical variation resulted in dramatically different and unexpected results in some situations. The use of different antibody clones for bcl-2, Ki-67 and CD10 increased the variation and mostly resulted in different proportions of unscorable cases due to non-specific background staining and artefacts. Also when using the same clones, staining intensities were quite variable, but this had generally a minor influence on the reproducibility of the scoring. However, for CD5 and bcl-6, two labs obtained results that stood out as remarkably different from all others. Only these two labs routinely included immunohistochemical enhancement techniques in their staining protocols. This resulted in essentially different staining characteristics. Up to four times more CD5-positive DLBCL were recognised with these more sensitive techniques. Moreover, the pathologists who were used to these results in daily practice adhered to a different internal standard and more often considered weak staining in non-amplified techniques as positive and thereby showed a significantly lower level of agreement to the other contributors on these markers.

The amplification effect was most impressive for bcl-6. In our series, maximal enhancement of bcl-6 staining showed that actually all DLBCL expressed bcl-6 protein to some extent. This is in line with gene-expression results which show that indeed virtually all DLBCL do express bcl-6 to some extent and the distribution of bcl-6 levels is more or less linear. This implies that the distribution of percentage classes fully depends on the sensitivity of the techniques used and that the choice of cut-off levels and the percentage of positive and negative cases that follows from these cut-offs is very much arbitrary. When the poor reproducibility of class assignment is taken into consideration (53–80% agreement and 0.37–0.58 for κ), it is obvious that comparison of studies from different centres is not straightforward and that published studies harbour unpredictable technical and interpretation differences. Indeed, the strong prognostic value of even a minor component of bcl-6 positive tumour cells in patients with DLBCL treated with CHOP chemotherapy only is quite unexpected.22 Therefore, although in general the increase of the detection levels by modern immunohistochemical enhancement techniques (standard citrate retrieval, Powervision, Envision) has been of great benefit, it certainly hampers the comparison of published literature and may show unexpected results.

Inability to assess a staining result in an individual case may be an underestimated problem. Apart from the TMA-specific problem of missing cores, reasons for excluding a case for a specific marker included different types of technical artefacts, especially high background staining and absence of an internal control. This played a major role in the first rotation, but also in the optimised staining of the second rotation. The aspect of the internal controls was mostly encountered for bcl-6 and CD10, in which the internal controls can be very sparse. For all other stains, admixed T-cells are always present and can serve as internal control. For Ki-67, mitoses that are always present in DLBCL may serve as internal controls and should be looked for. Since there is always a sizable percentage of tumour cells in cycle in DLBCL, completely negative results can be considered as not reliable.

This study shows that agreement was better for markers scored with only two categories compared to multiple categories. This would form a strong argument against too refined scoring and for omitting essentially non-reproducible cut-off points in situations that would be acceptable from a biological point of view. For daily practice, this may have implications for the distinction of DLBCL and (atypical) Burkitt lymphoma in which a very high proliferation rate is one of the defining parameters. We could not reproducibly score Ki-67, however, in categories of 75–95% versus >95% (κ = 0.39), while combining the highest categories dramatically improved reproducibility (κ = 0.58), showing that the problem indeed lies in the upper ranges and that 95% of positive cells may not be a reliable cut-off point. Taken together, Ki-67 may therefore not be a marker of choice for the classification of Burkitt lymphoma in daily practice.

All described technical issues that have an effect on staining and scoring variation may have consequences for class assignment of DLBCL into GCB and non-GCB phenotypes. Using the most generally applied algorithm, proportions vary between 65%/35%5 and 51%/49% in the published series reported thus far.9 Selected series fall significantly outside these ranges, however, suggesting also a true biological spectrum (non-GCB/GCB 68%/32% in refractory and relapsed patients,28 17%/83% in paediatric patients,29 58%/42% in high risk patients12). When excluding “not scored” cases and those with dichotomised markers (considered positive at a cut-off level of 25% for bcl-6 and MUM-1) in our study, a very high agreement across nine labs was seen (κ = 0.77, 89% agreement). Agreement dropped, considerably however, when the “not scored” category was considered in the analysis (κ = 0.62, 77% agreement). Therefore, pathologists are only moderately good at consistent class assignment according to the standard algorithm, particularly on the basis of TMAs. In comparison, optimised central staining and scoring for HER2 in breast cancer patients reaches agreement levels of 81.6% and 75.0% for non-optimised staining in a large validation study.30 The fact that it is only possible to reach an acceptable level of inter-observer reproducibility in this highly controlled scoring protocol is a strong argument to limit treatment stratification on the basis of GCB versus non-GCB features to clinical trials with central pathology review support, as is also recommended for HER2 testing in breast cancer patients.

Only when the technical aspects and results are harmonised can biologically relevant cut-off levels be determined and can we start to define the relevant factors that determine the prognostic value of immunohistochemical markers in DLBCL, including in different biological subgroups or treatment groups (e.g. rituximab).31 ,32

Taken together, this study shows that semi-quantitative immunohistochemistry for prognostic stratification of DLBCL is feasible in a reproducible way, but with varying rates of success for different markers. Lack of harmonisation of techniques and interpretation is likely to explain in part the wide variability of published results. At this stage, it is recommended that clinical decisions based on immunohistochemical stratification be performed in the context of clinical trials with centralised consensus review and validated assessment of biomarkers, and not on results of individual local centres. For individual centres, it is highly recommended to retain a very critical approach towards their own results, to be conscious of the technical reproducibility of the markers at the local setting, to be involved in local and national quality control rotations, and to build up experience with the results to develop sufficient insights in the range of staining variations of cases of DLBCL.

Take-home messages

  • Immunohistochemical markers in lymphoma should only be scored in the presence of adequate staining of cell populations as internal controls.

  • Due to high levels of technical and scoring variations, comparison of published series of immunohistochemical markers in lymphoma should be done with caution.

  • At this stage, clinical decisions based on immunohistochemical stratification should only be performed in the context of clinical trials with centralised consensus review and validated assessment of biomarkers, and not on results of individual local centres.

Acknowledgments

Contributors: The Lunenburg Lymphoma Biomarker Consortium is a collaboration of nine international lymphoma collaborative groups, each represented by a clinical investigator and one or more haematopathologists and supported by a team of statisticians.

EORTC Lymphoma Group: Daphne de Jong, Dennis Veldhuizen, John Raemaekers

HOVON: Daphne de Jong, Marie José Kersten, Anton Hagenbeek

GELA: Philippe Gaulard, Thierry Molina, Josette Briere, Gilles Salles

British Columbia Cancer Center: Randy Gascoyne, Mukesh Chhanabhai, Laurie Sehn

ECOG: Randy Gascoyne, Sandra Horning

DSHNHL (German High Grade non-Hodgkin Lymphoma Group): Christoph Thorns, Andreas Rosenwald, Wolfram Klapper, German Ott, Sylvia Hoeller, Heinz-Wolfram Bernd, Michael Pfreundschuh

NLSG: Birgitta Sander, Eva Kimby

St Bartholomew’s Hospital: Abigail Lee, Andrew Norton, Andrew Clear, Andrew Lister

Independent pathology advisor: Elias Campo, Barcelona, Spain; local support: Antoni Martinez

Dana-Farber Cancer Institute, Boston, USA: Edie Weller

REFERENCES

View Abstract

Footnotes

  • Funding: The project is initiated and financially supported by the van Vlissingen Lymphoma Foundation. In addition, unrestricted grants were received from Genentech, Millennium Pharmaceuticals Inc., Roche International and Schering AG by the van Vlissingen Lymphoma Foundation.

  • Competing interests: None.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles

  • Postscript
    BMJ Publishing Group Ltd and Association of Clinical Pathologists