Background Human epididymis protein 4 (HE4) measurements in serum have been proposed for improving the specificity of laboratory identification of ovarian cancer (OC).
Objective To critically revise the available literature on the comparison between the diagnostic accuracy of HE4 and carbohydrate antigen 125 (CA-125) to confirm the additional clinical value of HE4.
Methods A literature search was undertaken on electronic databases and references from retrieved articles; articles were analysed according to predefined criteria. Meta-analyses for HE4 and CA-125 biomarkers with OR, diagnostic sensitivity, specificity, positive (LR+) and negative (LR–) likelihood ratios as effect sizes were performed.
Results 16 articles were originally included in meta-analyses, but two for HE4 and one for CA-125 were eliminated as outliers. Furthermore, for HE4 a publication bias was detected. ORs for both HE4 (37.2, 95% CI 19.0 to 72.7, adjusted for publication bias) and CA-125 (15.4, 95% CI 10.4 to 22.8) were significant, although in a heterogeneous set of studies (p<0.0001). By combining sensitivity and specificity, the overall LR+ and LR– were 13.0 (95% CI 8.2 to 20.7) and 0.23 (95% CI 0.19 to 0.28) for HE4 and 4.2 (95% CI 3.1 to 5.6) and 0.27 (95% CI 0.23 to 0.31) for CA-125, respectively.
Conclusions HE4 measurement seems to be superior to CA-125 in terms of diagnostic performance for identification of OC in women with suspected gynaecological disease. Due to the high prevalence of OC in post-menopausal women and the need for data focused on early tumour stages, more studies tailored on these specific subsets are needed.
- Ovarian Tumour
- Tumour Markers
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Ovarian cancer (OC) is the sixth most common gynaecological malignancy characterised by an incidence rate that increases with age and in post-menopausal status. The crude incidence rate changes from 4.7 per 100 000 in women <50 years of age to 29.6 per 100 000 in the age group of 50–64 years.1 OC is currently the first cause of death in gynaecological malignancies; ∼75% of patients are diagnosed at an advanced stage, since OC is generally asymptomatic in the early stages and no effective screening approach is available.2 The net discrepancy between survival rates in early and advanced stages (80–90% vs 15–20%) has reinforced the need for biomarkers with higher diagnostic accuracy to set up screening programmes and/or to early distinguish malignancy from benign pelvic mass.2
Carbohydrate antigen 125 (CA-125) is the established biomarker for detecting OC recurrence and monitoring therapeutic response. In addition, recent guidelines recommend its measurement in the primary care setting in women with suggestive symptoms or at high risk for OC, in combination with pelvic ultrasound,3 ,4 even though some authors have discouraged this application because of the low sensitivity of the test, which is even worse in early stage tumours (∼50%).5 It is noteworthy that CA-125 is consistently expressed in serous and endometrioid OC, whereas tumours detectable at early stages have a higher prevalence of non-serous carcinomas.6 Overall, CA-125 effectiveness in the identification of the malignancy is threatened by its low diagnostic specificity. In fact, this glycoprotein is widely distributed on the surface of cells of mesothelial origin in various benign and malignant conditions other than OC.7
Among a wide spectrum of biomarkers recently proposed to aid in the diagnosis of women with suspected OC,8 human epididymis protein 4 (HE4) is undoubtedly the most promising. Its measurement was from the beginning proposed to improve the diagnostic specificity of CA-125, just maintaining a similar sensitivity.9 HE4 has homology with some secreted serine protease inhibitors and was reported to be amplified in some CA-125-deficient OCs, whereas its expression is lower in normal ovarian tissue, ovarian benign disease and low-malignant potential tumours.10 After preliminary studies confirming genomic and immunohistochemical findings on HE4,8 a large body of literature has been recently produced. Despite the low number of initially available studies, recent guidelines resorting to a meta-analytic approach have suggested HE4 to be used as an aid in OC diagnosis.3 In addition, a systematic review (SR) has been recently published reporting better diagnostic performance in terms of sensitivity, specificity and likelihood ratios (LR) for HE4 than for CA-125.11 However, the type of included studies, the applied selection criteria and the statistical approach used to synthesise the evidence could be criticised. Exploiting the more recent increase of studies on the comparison of HE4 and CA-125 diagnostic performances for OC, we designed an SR to critically revise available literature overcoming the above-reported threats of Yu's SR. In particular, we sought to provide a synthesis of the available evidence on the diagnostic accuracy of the tests by considering only those studies evaluating both markers on the same case series. Methodologically, a stepwise selection of the studies and a further application of proper summary receiving operating curves (SROC) analysis was used to strengthen the evidence.
Literature search strategy for identification of studies
The peer-reviewed literature published up to January 2012 was searched using the Medline (since 1966) and Embase (since 1993) databases, with Mesh terms (Human Epididymis 4 or HE4 and Ovarian), and with limits ‘Title/Abstract, Human Subjects, English’. In addition, the reference lists of retrieved articles and of a previously published meta-analysis were screened to identify further studies.11 The final aim of the search was to identify those original articles in which serum/plasma HE4 and CA-125 measurements were investigated and compared for OC diagnosis in order to provide a synthesis of the scientific evidence by the meta-analysis process.
Article evaluation and data extraction
First, two reviewers (SF and FB) evaluated the title and abstract of all preliminary identified records to assess whether the paper was relevant to the aim of the study. Then, by evaluating the complete manuscript, it was determined whether the preliminary selected papers met the following main criteria:
The primary or secondary aim of the study was at least the report of HE4 and CA-125 mean concentrations or sensitivity and specificity versus the ‘gold standard’ method for OC diagnosis—that is, laparoscopy with histological evaluation of biopsy material.
Diagnostic parameters were estimated according to a decisional threshold level and not to a fixed specificity or sensitivity.
The presentation of quantitative data allowed at least calculation of the OR.
The investigated population was represented by women with a gynaecological disease suspected as being OC, which is the intended spectrum of patients to be investigated by circulating biomarker detection.
HE4 concentrations were not included in diagnostic algorithms used to classify patients (incorporation bias).
Papers were excluded in the following instances:
Duplicative results from the same authors’ group were being reported.
Serum/plasma HE4 concentrations were measured to assess OC recurrence, to monitor disease progression or the effect of therapy.
They were case reports.
The quality of the selected studies was judged according to the QUADAS II 12 criteria.
This four-phase tool was built first to identify possible sources of bias concerning patient selection, index test, reference standard, and their administration (ie, flow and timing). In addition concerns for applicability were assessed in the first three key areas.
For both risk of bias and concerns for applicability, the individual criteria were classified as ‘low’, ‘high’ or ‘unclear’ and the results were presented using tables available on the QUADAS website (http://www.quadas.org).
Detailed information on the target population was extracted according to the QUADAS II checklist concerning participants.12
Grading was applied to each study for rating the quality of evidence.13
Meta-analyses were conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.14 All quantitative data of selected studies were uniformed as OR as effect size (ES), with corresponding 95% CI, by Comprehensive Meta-Analysis (CMA) software V.2.2 (Biostat, Englewood, New Jersey, USA). Using CMA, a test for outliers was performed and studies with residual p value <0.05 were eliminated. Q and I2 statistics were used to test the homogeneity among ES results. To calculate overall combined ES, CMA provides different meta-analytic models: in particular, if the assumption of heterogeneity has been identified the random effect model is used; otherwise, the fixed-model is adopted. Resulting ORs were presented as forest plots with the corresponding 95% CI. The Q statistic was also used to test the significance of moderators.
The Egger linear regression method (available in CMA) was used to estimate potential publication bias. If Egger's method showed a statistically significant bias (p values <0.05), the ‘trim and fill’ method was used to adjust ES for bias in a funnel plot. Briefly, the ‘asymmetric’ trials on the right side of the funnel (ie, trials that have no left side counterpart) were first located. These trials were removed (‘trimmed’) from the funnel, leaving a symmetric remainder from which the true centre of the funnel was estimated by the standard meta-analysis procedure. The ‘trimmed’ trials were then replaced and their missing counterparts imputed (‘filled’): these were mirror images of ‘trimmed’ trials with the mirror axis placed at the pooled estimate. This allowed the calculation of an adjusted overall CI.
For studies with available binary data, we considered sensitivity and specificity as ES using the Meta-Analysis of Diagnostic and Screening Test (Meta-DiSc) program, V.1.4 (freeware).15 In addition, the estimate of SROC to describe the relationship between test sensitivity and specificity across all studies was considered.15
Data were presented as forest plots with the corresponding 95% CI. Positive (LR+) and negative (LR–) likelihood ratios, corresponding to sensitivity/(1−specificity) and (1−sensitivity)/specificity,16 were also estimated and meta-analysed with Meta-DiSc. In particular, the strength of the indication for the presence of the disease provided by the positive result of the test is relevant when LR+≥10, modest when 5≤LR+<10, and poor when 2≤LR+<5, and the strength of the indication for the absence of the disease provided by the negative result of the test is relevant when LR–≤0.10, modest when 0.10<LR–≤0.20, and poor when 0.20<LR–≤0.50.17 Finally, for each study, positive and negative predictive values were estimated.
Features of retrieved studies
The search strategy retrieved a total of 252 potentially eligible papers, restricted to 161 after removing duplicate records. After evaluation of titles and abstracts, a further 106 records were excluded and a total of 55 original articles were preliminary considered eligible for the full text examination. Among those, 39 papers were excluded because of:
sensitivity and decisional threshold for HE4/CA-125 were estimated by fixing specificity (n=20);
diagnostic sensitivity and specificity were obtained only using a diagnostic algorithm in which HE4 and/or CA-125 was included (n=3);
partial or total inclusion in the control group of healthy individuals (n=10);
only median HE4/CA-125 concentrations were available (n=4); and
reported markers evaluation was on healthy subjects only (n=2).
Finally, a total of 16 articles met the criteria to be included in the meta-analysis (see online supplementary figure S1).18–33 The main characteristics of selected studies are summarised in table 1, and table 2 shows data from studies with binary data presentation, including the prevalence of OC, the adopted cut-off, and parameters related to the diagnostic performances of HE4 and CA-125 in each study.
In all studies, participants were enrolled because of the presence of gynaecological disease or, more specifically, of a pelvic mass suspected for OC.22 ,26–31 ,33 According to the QUADAS II checklist, details of the selection criteria, enrolment, sampling and data collection were retrieved.
Only a few studies adopted restrictive selection criteria by excluding pregnant women,22 ,28 subjects with presence or previous history of cancer,19 ,22 ,28 oophorectomy26 or positive to breast cancer gene expression.32 One study included in the case group patients with low-malignant tumours potentially not detectable by biomarkers and another included in the control group only women with endometriosis.23 ,30 As evidenced in table 1, the enrolment widely differed across studies for the following:
Setting of data collection (gynaecology–oncology or gynaecology).
Sample size and OC prevalence.
Patient characteristics (ie, prevalence of women of post-menopausal status).
Severity of OC (ie, prevalence of late stages).
Each of these points theoretically represented a source of heterogeneity among studies likely influencing the pre-test disease probability. Forty-four per cent of the studies were performed in a gynaecology–oncology setting, suggesting a different assessment of the disease and a higher grade of severity for OC. It is noteworthy that studies including early OC stages were performed in gynaecology. As table 1 and table 2 clearly show, the sample size and OC prevalence widely influenced the precision of the estimates and the reliability of markers’ diagnostic parameters. Wide differences in the prevalence of women in post-menopausal status across studies should influence HE4 and CA-125 diagnostic performances. Similarly discrepant performances may be reported according to the prevalence of early OC stages.
In most studies the recruitment was based on the result of transvaginal ultrasonography18 ,19 ,22–27 ,31 ,32; in the remainder it was according to clinical and laboratory data. Both prospective enrolment of patients and retrospective collection of data were performed to assure the evaluation of continuous case series.
Specimen collection was quite similar for all studies: venous blood was generally drawn before surgery into tubes containing no anticoagulants (EDTA was used in only one study)25 and, after centrifugation, samples were stored at −70/−80°C until measurements were done. Most studies used the manual HE4 enzyme immunoassay from Fujirebio Diagnostic (n=7) or the fully automated chemiluminescent microparticle-based assay on the Abbott Achitect platform (n=6). For these assays, employing the same antibodies, but differing with respect to the signal detection technology (colorimetry vs chemiluminescence), similar diagnostic performance has been reported in a head-to-head comparison by Ruggeri et al.24
As reported in the table 2, decision thresholds for HE4 differed across studies, even in those employing the same assay. Most studies seem to adopt decision thresholds suggested by the assay manufacturer, differentiated for menopausal status only for the Architect assay. Others derived threshold values from a parallel cohort of healthy controls or from a training group,18 ,27 ,29 ,32 or by maximising sensitivity and specificity.28 ,30
Quality and level of evidence from individual studies
The risk of bias for patient selection, index test, reference standard, flow and timing as well as the concerns for applicability related to the first three domains are shown in figure 1.
All the selected studies avoided spectrum bias by evaluating the diagnostic performances of HE4 and CA-125 in women with a gynaecological disease and suspected as having OC, thus meeting the aim of the review. Notably, eight studies selected only women with a pelvic mass,22 ,26–31 ,33 who might be the most appropriate population to be submitted to evaluation of circulating markers for diagnostic purposes. With only two exceptions,20 ,21 studies assured a consecutive enrolment, thus avoiding a selection bias. In 50% of studies a partial verification bias may occur since not all patients with benign gynaecological disease were diagnosed with the reference diagnostic method.22 ,26–31 ,33 Most studies retrospectively enrolled patients with available clinical data, and had a cross-sectional design (n=9) or were case–control studies (n=4), whereas only three articles were prospective clinical trials (PCTs).
The score related to the risk of bias and applicability for the conduct and interpretation of the index test mainly accounted for the use of assays with likely different performances (25% of studies), and for the selection of the diagnostic thresholds. Notably, in a minority of studies an overestimation of diagnostic accuracy may be suspected because of the application of data-driven cut-off values (ie, the best threshold).28 ,30
The greatest concern in the category of applicability was related to patient selection. Fifty per cent of the studies strictly selected patients with a well characterised pelvic mass and did not consider the wider framework of gynaecological diseases. In addition, only on these selected patients there are no concerns on the applicability of the reference standard. Concerns about the applicability of the index test might be overcome by working on the commutability of assays and on their diagnostic thresholds.
However, resorting to GRADE guidelines, we were able to rate the quality of the body of evidence. Only three (19%) of the studies provided high quality and level of evidence; most evidence was classified as low for both domains.
This implies that our confidence in the effect of estimate is quite limited. There might be a not negligible risk that the true diagnostic performance might be overestimated.
Among the 16 studies included in the meta-analysis, two for HE4 and one for CA-125 were identified as outliers and thus eliminated.29 ,31 Figure 2 shows the random overall combined ES shown as a forest plot of the OR and corresponding 95% CI. The OR was significant for both HE4 (43.2, 95% CI 21.9 to 85.4) and CA-125 (15.4, 95% CI 10.4 to 22.8) in a heterogeneous set of studies (for HE4: Q=117.2, p<0.0001, I2=88.9%; for CA-125: Q=60.0, p<0.0001, I2=76.6%). OC prevalence, type of assays, type of enrolling medical centre, and study design were analysed as moderators, but none of these characteristics influenced total ES. The Egger linear regression showed a significant publication bias (p=0.03) only for HE4 outcome. Using the ‘trim and fill’ method, the adjusted value of the overall combined ES was 37.2 (95% CI 19.0 to 72.7) with one trimmed study (see online supplementary figure S2).
Among the 14 studies displaying binary data, one for HE4 and one for CA-125 were identified as outliers and eliminated.29 ,31 Meta-analyses revealed an overall sensitivity of 79% (95% CI 76% to 81%) and a specificity of 93% (95% CI 92% to 94%) for HE4, and an overall sensitivity of 79% (95% CI 77% to 82%) and a specificity of 78% (95% CI 76% to 80%) for CA-125 (figure 3). For these studies, the global OR was 61.1 (95% CI 31.5 to 118.5, adjusted for publication bias) for HE4 and 17.4 (95% CI 11.9 to 25.4) for CA-125.
Because of the heterogeneity of studies for both HE4 and CA-125, asymmetric SROC curves were used (see online supplementary figure S3). However, the visual inspection of the fitted curves revealed a wide scatter, in particular for HE4 studies, mainly due to a deceptive interpolation of a few points located in the upper left side of the graph. This is likely consequent to the similar sensitivity and specificity of marker assays, mainly due to the use of the same antibodies and to overlapping/harmonised thresholds.
LR+ and LR– were 13.0 (95% CI 8.2 to 20.7) and 0.23 (95% CI 0.19 to 0.28) for HE4, and 4.2 (95% CI 3.1 to 5.6) and 0.27 (95% CI 0.23 to 0.31) for CA-125 (figure 4). Knowledge of LR+ is mandatory to evaluate the capability of the marker to recognise OC in suspected women. In this regard, our meta-analysis showed a higher LR+ for HE4 than for CA-125, assigning to HE4 a relevant capability for ruling OC in. On the contrary, both markers displayed relatively high LR–, indicating a poor capability to exclude the presence of OC. Given the high heterogeneity of enrolled patient populations in different studies, a meta-analysis of predictive values was not done.
An additional evaluation of diagnostic performance of the combined measurements of HE4 and CA-125 was performed in four studies.22 ,23 ,30 ,32 Their meta-analysis resulted in a pooled sensitivity of 82% (95% CI 78% to 86%) and a pooled specificity of 76% (95% CI 72% to 80%).
Four studies specifically evaluated HE4 and CA-125 on subgroups of post-menopausal women.19 ,21 ,26 ,32 However, the small study size (a total of 459 cases vs 381 controls) and the huge heterogeneity among studies did not permit any data pooling and further statistical elaboration. Due to a few articles evaluating biomarker performance in the detection of OC at an early stage (International Federation of Gynaecology and Obstetrics (FIGO) stages I and II),20 ,30 ,32 this issue also needs to be further studied.
The main challenge for laboratory biomarkers of OC diagnosis is to allow the accurate detection of malignancy as early as possible to improve clinical outcome and survival of patients.34 Currently, CA-125 is the most widely used marker in OC diagnostics, even if there is no agreement among different guidelines on the use of CA-125 for the screening and evaluation of high-risk women in a primary care setting.34–36 This is possibly because supporting evidence is only indirect, coming from systematic reviews of studies performed in secondary care settings that may significantly differ in case mix.3 On the other hand, the intrinsic limitations of CA-125 have greatly stimulated the search of additional biomarkers sought to improve the accuracy for identifying malignancy in women with a pelvic mass. Among others, HE4 has been reported as the most promising marker to aid in OC diagnosis.9 The only available meta-analysis evaluating its diagnostic value is, however, affected by important methodological limitations.11 First, the study failed to evaluate the HE4 diagnostic performance in the right clinical context (ie, women with a suspected gynaecological disease): Yu et al did not exclude studies partially or totally enrolling in the control group healthy subjects, a clinically not relevant population, with a possible spurious increase in the clinical efficacy of the marker.11 ,37 Second, the use of symmetric SROC curves in synthesising diagnostic accuracy is prone to criticisms whether included studies display a wide heterogeneity as in the case of HE4 and CA-125. In these conditions, it is recommended to adopt asymmetric SROC curves. Even in this case, at visual inspection SROC curves may appear not appropriate, as in the case of CA-125 and HE4: the model is indeed underpowered to obtain reliable estimates with adequate precision.38 The obtained asymmetric SROC curves were fitted according to the lowest number of points covering a tight area of the graph; thus, it seems unreasonable to obtain a curve reflecting test performances at highest sensitivity and specificity, where no points were observed (lower left and upper right corner of the graph). Such a situation may occur when most studies resort to the same assay and/or to similar thresholds or results restricted to a narrow range of values. The wide imprecision of the obtained SROC curves as well as the fact that they may result from a deceptive and thus misleading mathematical interpolation can make their interpretation mistaken. Furthermore, the review by Yu et al did not deal with some relevant clinical questions concerning the introduction of HE4, such as its effectiveness in post-menopausal women and in early-stage OC. Finally, the evaluation of diagnostic performance of the combined measurements of HE4 and CA-125 was not considered, although this was the original intended application of the marker.10
Our results showed that women with gynaecological disease and increased concentrations of HE4 or CA-125 are at higher risk for malignancy. In particular, the risk for OC is significantly increased for patients with HE4 positive results (OR 37.2). As expected from immunohistochemical data,10 the sensitivity of HE4 and CA-125 overlapped (79%), while HE4 exhibited a significantly higher specificity than CA-125 (93% vs 78%). The LR calculation confirmed that HE4 outperforms CA-125 in identifying OC (LR+: 13.0 vs 4.2), whereas the capability to rule out OC was quite similar for both markers and rather poor. These results support the hypothesis that HE4 could replace CA-125 measurement as a standalone biochemical test for OC diagnosis more than improve its diagnostic performance by combining their measurements.
Although the evidence of diagnostic effectiveness in detecting early-stage tumours in post-menopausal women is of pivotal relevance, there are currently not enough studies for estimating HE4 performance in this clinical scenario. In particular, the focus on menopausal status is of relevance since guidelines assign the highest baseline risk index to post-menopausal women.34 The possibility that HE4 may differently perform according to the menopausal status is not marginal since higher HE4 concentrations are physiologically detectable in post-menopausal women and this may require the definition of specific clinical thresholds for this condition.34
Limitations to the clinical validity of presented results are the significant publication bias for HE4 studies and the heterogeneity among retrieved studies. The adoption of the ‘trim and fill’ method to adjust pooled estimates for the publication bias suggested, however, a relatively marginal effect of this limiting condition and the recalculated OR was not significantly lower than the unadjusted one.39 The estimated heterogeneity among studies initially seemed mainly due to different sample size leading to a slight funnel plot asymmetry. However, additional sources of heterogeneity emerged by evaluating the quality of primary studies that often resulted in suboptimal resorting to QUADAS checklist. Differences in study design, in clinical source of patients, in the adoption of eligibility criteria (eg, inclusion of patients with low-malignant potential ovarian tumours) were evidenced. In addition, studies often did not look comparable for the uneven distribution of patients in pre- and post-menopausal status, OC histological subtypes, and OC FIGO stages. Furthermore, only in some studies was the OC diagnosis performed by or in collaboration with a gynaecologist oncologist. This is not a marginal issue as there is a lively debate about the different prognostic impact of diagnostic management when performed by surgeons with an appropriate expertise or by the gynaecologist alone.40 ,41 Another relevant issue was represented by the HE4 concentration adopted as decision threshold for OC diagnosis, since across studies there was often a subjective adoption of different threshold levels, even when using the same assay. Finally, there was no agreement about the need to select different HE4 thresholds for pre- and post-menopausal women.
In conclusion, there is only a preliminary and mild evidence on the ability of HE4 measurement in serum to overcome CA-125 in terms of diagnostic performance for identification of OC in women with suspected gynaecological disease. The suboptimal quality of research, the modest level of evidence and the lack of agreement on decision thresholds are likely to hamper the clinical value of the marker. Before integrating HE4 in the OC diagnostic algorithm, in order to replace or to complete the CA-125 information, more robust estimates of HE4 diagnostic performances are needed. In particular, well designed PCTs are required to reinforce this preliminary evidence and, in particular, to evaluate the HE4 capability to identify OC at early stage in post-menopausal women with a pelvic mass.
The risk for OC is significantly increased for patients with HE4 positive results (OR 37.2).
HE4 exhibited a significantly higher specificity than CA-125 (93% vs. 78%).
HE4 outperforms CA-125 in identifying OC (LR+: 13.0 vs. 4.2).
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
Contributors SF and FB initiated and supervised the study, performed the analysis and wrote the paper. MP contributed to the design of the study, critically revised the paper and introduced relevant scientific remarks. ML helped collect the published articles. ML, EB and PB revised the statistical approach. All of the authors have read and approved the final paper.
Competing interests None.
Patient consent Obtained.
Provenance and peer review Not commissioned; externally peer reviewed.