Statistics from Altmetric.com
The excellent paper by Rhodes and colleagues1 provides a valuable insight into the factors that might cause interlaboratory variability in the immunohistochemical demonstration of oestrogen receptors, now probably the most frequent histopathology result that determines specific patient treatment.2 There is, however, one section of the paper where the statistical analysis may be causing confusion, rather than clarity.
In table 6, the degrees of agreement between the participant and organising laboratories for oestrogen receptor expression using the “Quick score” method are given. For each level of expression a κ statistic is given, and the values of all these statistics are less than zero, indicating a degree of agreement that is worse than chance alone. These κ statistics might be a reasonable reflection of the low levels of concordance, but it is unusual to calculate them for each level of expression, rather than all levels of expression together,3 and it is not clear which results have been included in each 2 × 2 contingency table to derive these statistics. A more usual way to assess the level of agreement would be to have a 4 × 4 contingency table with all levels of expression within it and to calculate a single κ statistic for the overall agreement.4 The authors may have done this in the final line of table 6 but it is not clear that this is the case. Because a misclassification between distant categories—such as the classification of a high expressing tumour as a low expressing tumour—is more important than misclassification between adjacent categories, a weighted κ statistic might be appropriate.5,6 It should also be remembered that the quick method of scoring will be subject to observer variability,7 but the figures in the paper (for example, figs 8 and 9) demonstrate that it is an obvious difference in the intensity of staining, rather than interpretative variation, that accounts for the difference in the results.
None of these minor points should detract from the very important results of Rhodes et al,1 which have important implications for any laboratory running an immunohistochemical service assessing oestrogen and progesterone receptor expression.
The authors' reply
We would like to thank Dr Cross for his interest shown in our article and for his kind and constructive comments.
Cohen's κ statistic1 was used in our study to compare the degree of agreement between the immunohistochemical (IHC) sensitivity for oestrogen receptors (ER) achieved by 152 laboratories on in house breast tumours, to that achieved by the UK National External Quality Assessment Scheme for Immunocytochemistry (NEQAS-ICC) organising laboratories' IHC assay on spare sections from the same cases, as evaluated by the “Quick score” method.2 As quite rightly deduced by Dr Cross, the rationale behind using κ statistics for this part of the study, in addition to the Wilcoxon's matched pairs signed ranks test and the χ2 goodness of fit test, was to emphasise the lack of agreement between the pairs of matching slides. This was reflected in the negative κ scores, and the significant differences shown in the results of the other statistical tests, indicating a degree of agreement that was worse than would have occurred by chance alone.
In retrospect, we agree that the way the κ statistics were calculated and the way the results were expressed may have been confusing, and we welcome this opportunity to clarify the results by repeating the calculations using a 4 × 4 contingency table and by calculating a single κ statistic as suggested by Dr Cross, using the formula detailed by Robertson et al in 1981.3 Table 1 gives the results of this analysis.
This approach yields a κ coefficient of 0.19 when the Quick scores are evaluated by one of the authors (AR) and 0.20 when evaluated by a second (BJ). Although “yardsticks” are arbitrary and should not be slavishly adhered to,4 κ values less than 0.4 are generally considered to show poor agreement.5,6 As suggested by Dr Cross, a weighted κ statistic might be more appropriate, because a misclassification between distant categories is of greater importance than a misclassification of adjacent ones, and we have therefore also performed these calculations.7 Although the weighted κ statistics of 0.30 (AR) and 0.34 (BJ) are slightly higher than the unweighted κ values, they are still less than 0.4, confirming that agreement between the two assays is poor.
Lastly, to emphasise that the differences observed resulted predominantly from differences in the sensitivities of the IHC assays and not observer error, we have used the same weighted κ statistic to determine the degree of intra-observer (AR) and interobserver (AR and BJ) agreement. Fleiss recommends that κ values between 0.4 and 0.75 represent fair to good agreement, and values exceeding 0.75, excellent agreement.5 The weighted κ statistics for intra-observer agreement in our study for the evaluation of the Quick scores of the participants' assays and the organising laboratories' assay on the same in house tumours, are 0.70 and 0.74, respectively, indicating good agreement by the same assessor when evaluating the same slides on two different occasions. The weighted κ statistics for inter-observer agreement when assessing the participants' and organising laboratories' IHC results by Quick score evaluation in this study are 0.79 and 0.87, respectively, indicating excellent agreement. These findings support those of previous studies that have used the Quick score method of evaluation.8,9
In summary, we conclude that the results of these additional tests suggested by Dr Cross support and clarify those published in our original study. They emphasise that the significant differences observed in the Quick score evaluations of the IHC assay results for ER on in house tumours are caused by differences in the sensitivity of the assays in different laboratories and not by observer bias.