Article Text


An integrated tool for determining the primary origin site of metastatic tumours
  1. Marcos Tadeu dos Santos1,2,
  2. Bruno Feres de Souza3,
  3. Flavio Mavignier Cárcano4,
  4. Ramon de Oliveira Vidal2,5,
  5. Cristovam Scapulatempo-Neto5,
  6. Cristiano Ribeiro Viana5,
  7. Andre Lopes Carvalho5
  1. 1 ONKOS Molecular Diagnostics, Ribeirão Preto, São Paulo, Brazil
  2. 2 Department of Research and Development (R&D), Fleury Group, Sao Paulo, Brazil
  3. 3 Federal University of Maranhão, UFMA, Sao Luis, Maranhão, Brazil
  4. 4 Department of Medical Oncology, Barretos Cancer Hospital, Barretos, Brazil
  5. 5 Molecular Oncology Research Center, Barretos Cancer Hospital, Barretos, Brazil
  1. Correspondence to Dr Marcos Tadeu dos Santos, ONKOS Molecular Diagnostics, Department of Research and Development (R&D), Ribeirão Preto, SP 14056-680, Brazil; marcos{at}


Aims Cancers of unknown primary sites account for 3%–5% of all malignant neoplasms. Current diagnostic workflows based on immunohistochemistry and imaging tests have low accuracy and are highly subjective. We aim to develop and validate a gene-expression classifier to identify potential primary sites for metastatic cancers more accurately.

Methods We built the largest Reference Database (RefDB) reported to date, composed of microarray data from 4429 known tumour samples obtained from 100 different sources and divided into 25 cancer superclasses formed by 58 cancer subclass. Based on specific profiles generated by 95 genes, we developed a gene-expression classifier which was first trained and tested by a cross-validation. Then, we performed a double-blinded retrospective validation study using a real-time PCR-based assay on a set of 105 metastatic formalin-fixed, paraffin-embedded (FFPE) samples. A histopathological review performed by two independent pathologists served as a reference diagnosis.

Results The gene-expression classifier correctly identified, by a cross-validation, 86.6% of the expected cancer superclasses of 4429 samples from the RefDB, with a specificity of 99.43%. Next, the performance of the algorithm for classifying the validation set of metastatic FFPE samples was 83.81%, with 99.04% specificity. The overall reproducibility of our gene-expression-classifier system was 97.22% of precision, with a coefficient of variation for inter-assays and intra-assays and intra-lots <4.1%.

Conclusion We developed a complete integrated workflow for the classification of metastatic tumour samples which may help on tumour primary site definition.

  • cancer of unknown primary site
  • molecular pathology
  • metastasis

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from


Cancers of unknown primary sites (CUPs) constitute a heterogeneous group of confirmed metastatic tumours for which the primary origin site has not been identified after extensive clinical and laboratory examination. The latter is mainly conducted through histopathological analysis of biopsy materials using immunohistochemistry (IHC) and imaging tests.1 However, this costly2 and subjective3 approach is unable to correctly identify the cancer primary site in approximately 25%–35% of metastatic tumours,4 while postmortem evaluation at autopsy finds the primary origin sites in 55%–80% of cases.1 5

Adenocarcinomas and poorly differentiated carcinomas account for 90% of CUPs, whereas neuroendocrine carcinomas account for 5% and squamous cell carcinomas for the other 5% of cases.5 Despite technological advances in the diagnostic workup, CUPs still account for 3%–5% of all malignant neoplasms and are among the 10 most frequent classes of cancer worldwide.1 6

CUPs are very rare in children and are more common in individuals averaging 60 years of age, with a slightly higher frequency in males.5 Most CUPs (80%) have an unfavourable prognosis.6 The survival rate of patients with CUP is poor, with most patients having no more than 6–12 months to live.7 This low rate partly results from the fact that the primary origin site is unknown as knowing the site of origin is essential to optimise patient care.4 In fact, it has been shown that when the origin of the primary site is known then site-specific therapies can be used, which ultimately improve survival rates.2 Thus, there is a great need to improve the pretreatment diagnostic evaluation for patients with metastatic tumours with unknown or uncertain primary origin sites.

Although IHC is the standard technique for CUP evaluation,8 it fails to correctly identify the origin of many metastatic tumours. A meta-analysis combined with a literature review showed that IHC provided correct tissue identification in only 65.5% of the 308 cases of blended metastatic samples (95% CI 60.1% to 70.7%).9 The use of different complementary imaging exams such as mammography, radiography, ultrasounds and CT may be indicated to help correctly identify the tissues involved.10 Currently, the most commonly used imaging method for the detection of primary tumours in patients with CUP is a combination of 18F-fluorodeoxyglucose (FDG) positron emission tomography (FDG-PET) and CT.11 Although the use of FDG-PET/CT has indeed improved the management of patients with CUP, meta-analyses have reported that FDG-PET/CT is able to detect primary tumours in only 37% of patients with CUP.12 Thus, the current multidisciplinary approach still fails to correctly identify the primary site in a large number of patients. This indicates the need for new and complementary techniques for tumour classification, which would ultimately help design better treatments for patients with CUP.

Recently, molecular approaches have been used as an alternative to current pretreatment diagnostic options, especially for neoplasms that respond poorly to chemotherapy, by analysing mutations and/or expression status of specific molecular biomarkers. Genes such as EGFR, BRAF, PD-L1, ROS1, ALK, HER2, HRAS, KRAS, KIT, cMyc, BCL2 and others should be investigated in patients with CUP because they are targets for new therapeutic agents.1–5

Yet another alternative for defining the primary site in CUPs is the development of gene-expression classifiers that predict the tissue of origin and are consistent with clinical and pathological findings.13 In short, molecular classifiers are used to determine the molecular similarity of the gene expression profile of a sample of interest against a Reference Database (RefDB) of known tumour types.3 Gene-expression classifiers have been clinically validated and have been successfully used to identify the tissue of primary origin in approximately 85% of patients.4 14 15 All the molecular classifiers described in the literature have been prepared and processed in-house, with researchers using samples available on local biobanks in order to build their own RefDB.13 Although such an approach may be useful for the correct identification of CUP samples, considering that the protocol used to prepare and analyse the unknown samples is exactly the same used for the RefDB samples, it can be time-consuming and expensive, and the RefDB might be limited by sample availability.

Here we describe the results of a double-blinded and retrospective validation study of a real-time PCR-based gene-expression classifier on a set of metastatic formalin-fixed, paraffin-embedded (FFPE) samples, used to validate our RefDB. To our knowledge, this is the largest RefDB reported to date and the first RefDB totally built with tumour samples from microarray files obtained from public and online databases.

Patients and methods

Study design

This study was designed and supervised by all the investigators. All authors reviewed the study data, vouch for fidelity of the data, conduct of the study to the protocol and approved the decision to submit the manuscript for publication. Samples were tested at the Fleury Group centre (in a laboratory certified according to the provisions of the College of American Pathologists (CAP), ISO-9001, ISO-14001 and the Brazilian Clinical Laboratory Accreditation Program) and at Barretos Cancer Hospital, in a laboratory certified according to the provisions of the CAP, United Kingdom National External Quality Assessment Service and the European Molecular Genetics Quality Network.

All patients provided written informed consent to participate in the study. All investigators had full access to all data and analyses.

Tumour samples used to build our RefDB

The data used to build our RefDB were obtained from the analysis of microarray processed data from tumour samples, which can be downloaded from the public and online repository Array Express on the EMBL-EBI website. See online supplementary table S1 for all the accession codes used divided by cancer superclasses.

Supplementary file 1

In our study, only three microarray platforms were used, all from Affymetrix GeneChip Human Genome: A-AFFY-33 (HG-U133A [HG-U133A/B]), A-AFFY-37 (U133A 2.0 [HG-U133A_2]) and A-AFFY-44 (U133 Plus 2.0 [HG-U133_Plus_2]), except for Thymus (Thymoma) superclass, which is composed of sample files from Illumina ILMN_HumanWG_6v3 platform translated by ProbeID correspondence to A-AFFY-44. Whenever possible, preference was given to microarray files from metastatic FFPE tumour samples. No cell lines, xenotransplant or samples that were submitted to any cancer-related treatment were used. The samples were organised by tumour superclass categories composed of specific subclasses.

Metastatic FFPE samples: validation set

All metastatic FFPE tumour samples were obtained from the Barretos Cancer Hospital biobank, where they have been registered and kept. In this biobank, we identified cases of metastatic tissues of known primary origin obtained from surgical procedures. These were the FFPE samples used as the validation sample set for our gene-expression classifier.

The metastatic FFPE samples were separated according to primary cancer type and their frequencies in the biobank were calculated. A quota sampling was generated following the same proportion of primary cancer types found in the biobank. Each case received a code generated by Randomizer, a publicly available website-based tool ( We randomly selected each case for each quota and retrieved the corresponding metastatic FFPE sample from each case. Should a proportion of any strata have resulted in less than one, a unique case was represented whenever available.

The reference diagnosis of histological types was made by full pathological workup including H&E staining and IHC, when necessary. All cases were blindly reviewed by two independent pathologists (CS-N and CRV), and only samples classified as metastatic by both, and for which both professionals agreed about their primary origin, were subjected to mRNA extraction. The metastatic FFPE tumour samples were sectioned into 10 µm sections for mRNA extractions and into 5 µm section to be evaluated by H&E staining. As a blind study, the pathological classification consensus of the histological types (primary origin—reference diagnosis) of each sample was revealed when the molecular classification process was over. None of the validation specimens was used for algorithm training.

mRNA extraction, amplification and cDNA synthesis

mRNA from the FFPE samples were isolated from 2 to 6 10-µm tissue sections (0.5–2.0 cm² tumour area) using the RecoverAll Total Nucleic Acid Isolation Kit for FFPE (Ambion, Carlsbad, California, USA). Isolated mRNA was qualified by the Bioanalyzer Nanochip System (Agilent Technologies, Santa Clara, California, USA) and quantified by NanoDrop (Thermo Scientific, Waltham, Massachusetts, USA). Samples with an RNA Integrity Number (RIN) >1.4 and a concentration >4.4 ng/µL were eligible for a whole transcriptome amplification followed by cDNA synthesis using the TransPlex Whole Transcriptome Amplification kit (WTA2; Sigma Aldrich, St. Louis, Missouri, USA). Synthesised cDNA was purified using the QIAquick PCR purification Kit (QIAGEN, Hilden, Germany) and quantified with NanoDrop (Thermo Scientific) prior to use. All steps were performed according to the manufacturer’s instructions.

Real-time PCR and analysis

Customised TaqMan low-density array (TLDA) 384-well microfluidic cards with inventoried or made-to-order predesigned assays were designed for our selected 95 genes using the Custom TaqMan Gene Expression Array Card service from Life Technologies, Carlsbad, California, USA (format 96a). For each reservoir, around 700 ng of the purified cDNA was mixed with 100 µL of 2X TaqMan Gene Expression Master Mix (Applied Biosystems, Carlsbad, California, USA) and applied to each fill-port in a 200 µL final volume mix. Using the provided swing-buckets, the TLDA cards were centrifuged in order to fill the wells during two 1 min rounds at 1200 rpm each in a Sorvall Legend RT centrifuge (Thermo Scientific). After cutting off the reservoir flap, the TLDA cards were submitted to perform in the ViiA 7 real-time PCR System (Applied Biosystems). Samples were heated at 50°C for 2 min, at 95°C for 10 min and were then submitted to 50 cycles of 95°C for 15 s and 60°C for 1 min. After the real-time PCR run, the cycle threshold (Ct) of the genes that did not amplify (specified by the ViiA 7 software as ‘undetermined’) was arbitrarily assigned a value of 50 (which refers to the number of PCR cycles).

Gene-expression classifier

The gene-expression classifier compares the real-time PCR expression profile data from a metastatic FFPE sample with the microarray expression profile data from the 25 superclasses of our RefDB. For each sample, the gene-expression classifier calculates the relative expression level of each of the 92 discriminator genes by normalising to each of the three normalizer genes (discriminator and normalizer genes selected by us, as described below). For real-time PCR data, the values are expressed as an exponential delta cycle threshold (ΔCt)=2(Ct normalizer – Ct discriminator). For microarray data, the values are expressed as a fluorescence ratio=normalizer/discriminator. Thus, in this study, for each sample evaluated, 276 (92×3) values (attributes) were calculated and used by the gene-expression classifier algorithm to compare the metastatic FFPE samples with our RefDB. The gene-expression classifier considers the gender of the patient from whom the metastatic FFPE sample was obtained, prior to its comparison with our RefDB. For male samples, it does not take into account the ovary, uterus and breast superclasses, whereas it does not consider the prostate superclass for female samples. The classification is performed using the Random Forest algorithm16 employing the R package randomForest with all default parameters, except ntree=1000. Using this scheme, the metastatic FFPE sample is considered correctly classified if the reference diagnosis superclass is ranked among the first three superclasses, according to the class probabilities calculated by the classifier. The superclasses with probability values <5% are considered ‘rule out’ samples. The algorithm does not allocate samples into tumour subclasses. See online supplementary table S2 for the official gene symbols, access codes (RefSeq in NCBI), amplicon lengths of each TaqMan assay, microarray ProbeIDs used in RefDB and literature references used in the choice of each gene.

Normalizer gene selection

We compared Ct values to choose normalizer genes from the initial 95 genes selected. The Ct values, collected from all genes across all the metastatic FFPE samples submitted to real-time PCR, were used to calculate the Ct SD value for each gene. The genes were ranked from lowest to highest SD gene and the top 10 genes were tested as normalizer gene candidates. This test was only performed on the microarray sample files from the RefDB using a 10-fold cross-validation approach that measures overall accuracy. We tested all possible combinations of 1–5 candidates among the top 10 selected genes.

Quality control parameters applied to the metastatic FFPE tumour samples

Six genes were used to define the quality control (QC) parameters for the metastatic FFPE samples: the three selected normalizer genes plus the top three genes with the lowest SD (called QC1, QC2 and QC3 genes). The Ct values of these six genes across all the metastatic FFPE samples submitted to real-time PCR were used to observe data dispersion, to calculate IQRs and inner fence boundaries (dispersion intervals), and to identify minor outliers. Metastatic FFPE samples with Ct values for any of the six QC parameters out of the defined dispersion intervals were excluded from the final set of metastatic FFPE samples used for validation.

QC parameters applied to the tumour samples from our RefDB

For the microarray data of the tumour samples from our RefDB, QC parameters were defined by using the QC1, QC2 and QC3 genes in order to select samples with a better crosstalk between microarray and real-time PCR data. Using the Ct values obtained from the metastatic FFPE samples for each QC gene, three correlation values were calculated: AFFPE=2[QC1-((QC2+QC3)/2)], BFFPE=2[QC2-((QC1+QC3)/2)] and CFFPE=2[QC3-((QC1+QC2)/2)]. Then, the AFFPE, BFFPE and CFFPE correlation values from all the metastatic FFPE samples were used to observe data dispersion and to calculate IQRs and inner fence boundaries (dispersion intervals). These three dispersion intervals derived from the AFFPE, BFFPE and CFFPE correlations were applied to all microarray samples using the corresponding correlation formula (AARRAY=[(QC2 +QC3)/2]/QC1); BARRAY=[(QC1 +QC3)/2]/QC2) and CARRAY=[(QC1 +QC2)/2]/QC3)). All samples with AARRAY, BARRAY or CARRAY correlation values that were not in the defined dispersion intervals (minor outliers) were excluded from our RefDB final version.


To evaluate the precision of the gene-expression classifier, four metastatic FFPE samples were randomly selected: two from female (samples #52 and #56; ovary and kidney, respectively) and two from male (samples #19 and #58, liver and thyroid, respectively) patients. Each sample had nine replicates analysed by real-time PCR, using five different TLDA cards from three different lots. The intra-assay, inter-assay and inter-lot coefficients of variation (CV) values, with 95% Cl, were calculated based on the Ct values from the three normalizer genes and the three QC genes for each sample and overall. The final classification ranking provided by the gene-expression classifier was used to calculate the precision of each sample and overall.

Statistical analyses

Statistical analyses were performed using the R software, an open-source statistical programming environment. Sensitivity and specificity were calculated using established methods.17 CIs for proportions are reported as two-sided exact binomial 95% CIs.


RefDB composition and performance

The first version of our RefDB was composed of 7160 microarray files representing different tumour samples. After excluding the files with missing data on any of the selected 95 target genes (figure 1, step 1), 5266 files remained in version 2. By applying the QC parameters AARRAY, BARRAY and CARRAY (figure 1, step 12), the final version of our RefDB contained data from 4429 tumour samples, representing 58 tumour subclasses grouped into 25 superclasses. The number of tumour samples in each superclass varied from 22 (head and neck—salivary gland) to 626 (ovary), with a mean of 177 samples per superclass (table 1). All files were used as a training set for the gene-expression classifier, and performance was analysed by a 10-fold cross-validation that showed an overall sensitivity of 86.6% (3835/4429; 95% Cl 85.55% to 87.58%) and a specificity of 99.43% (95% CI, 99.18% to 99.60%) (table 2; see details in the Confusion Matrix at ; online supplementary table S3).

Table 1

The Reference Database composition

Table 2

Algorithm performance 

Figure 1

Flow chart of the necessary steps to obtain the final version of our Reference Database (RefDB) and the metastatic formalin-fixed, paraffin-embedded (FFPE) sample set for validation. Numbers in parentheses represent the step number until reaching the classification of the metastatic FFPE samples against the RefDB by the gene-expression classifier. QC, quality control.

Metastatic FFPE samples: validation set composition

Our validation sample set started with 127 FFPE tumour samples. After being reviewed by two independent pathologists (figure 1, step 3), six samples were excluded for being primary tumours and four due to disagreement between reviewers. One single metastatic male breast cancer sample was found (sample #73) and excluded because this superclass was not considered by the gene-expression classifier when applied to male samples. A total of 116 samples were submitted to RNA extraction (figure 1, step 4) and 5 yielded insufficient or no mRNA (4.3%). Of the 111 samples subjected to real-time PCR (figure 1, step 6), 6 samples (5.4%) were excluded for not meeting QC standards (figure 1, step 10.1). Therefore, the final set of metastatic FFPE samples included 105 samples, 51 (48.57%) from female and 54 (51.43%) from male patients. The main sites of the retrieved metastases were the lymph nodes, with 51 samples (48.57%), followed by the lungs with 17 samples (16.19%) and the liver with 16 (15.24%). None of the validation specimens was used for algorithm training.

Normalizer, discriminator and QC genes

The top 10 genes showing the lowest Ct SD in the 111 metastatic FFPE samples analysed by real-time PCR were used as candidates for normalizer genes. These were ARF5, LY6E, PANX1, KDELR2, SLC35F5, VPS33B, SDC1, HSDL2, MAP2K6 and SP2 (SD ranging from ±1.73 to ±3.51, figure 1, step 7). The tests performed on our RefDB showed that there were no improvements on the final results when more than three genes were used as normalizers. For example, when ARF5, VPS33B, SP2 and LY6E were tested, the result was 83.1%. The best combination of normalizer genes was ARF5, VPS33B and SP2, which resulted in 86.6% overall accuracy (figure 1, step 9). All the other 92 genes were used as discriminators. The top three genes with the lowest Ct SD among the discriminators were also used as QC genes. These were PANX1 (QC1), LY6E (QC2) and KDELR2 (QC3). The Ct dispersion range of the three normalizer genes and of the three QC genes were applied as QC parameters (figure 2A) to identify six outlier samples prior to forming the final set of metastatic FFPE samples for validation. The dispersion range of the correlation values corresponding to AARRAY, BARRAY and CARRAY from three QC genes were also applied as QC parameters (figure 2B) to identify 837 outliers before preparing the final version of our RefDB.

Figure 2

Quality control parameters. (A) Box plot of the cycle threshold dispersion range of the three normalizer genes and the three quality control (QC) genes used to determine the QC parameters for the metastatic formalin-fixed, paraffin-embedded samples from the validation set. (B) Box plot of dispersion ranges applied as QC parameters for the Reference Database based on the three QC genes used to calculate the AARRAY, BARRAY and CARRAY correlation values. The whiskers represents the lower and upper fences. (N), normaliser gene; Q1 and Q3,  1st and 3rd quartiles.

Gene-expression classifier agreement with the reference diagnosis

The only available information about the patients from whom the 105 metastatic FFPE samples were obtained was their gender. A total of 54 male metastatic FFPE samples were classified by the algorithm and compared against 3387 samples from the final RefDB (not considering the data from 1042 breast, ovary and uterus superclasses samples). Likewise, the 51 female metastatic FFPE samples were compared against 4310 samples from the final RefDB (not considering the data from 119 prostate superclass samples). The gene-expression classifier had high sensitivity and yielded the same results as the analyses of two independent pathologists for 83.81% of the samples (88/105; 95% Cl 75.35% to 90.28%), and 99.04% specificity (95% Cl 94.73% to 99.87%) (table 2; see details in the Confusion Matrix at online supplementary tables S4 and S5). Almost 70% (61/88; 95% Cl 58.58% to 78.71%) of the corrected classified metastatic FFPE samples were ranked in the first position by the algorithm, and only 8.6% (9/105; 95% Cl 3.99% to 15.65%) of the samples had the expected correct classification superclass with <5% of probability and were therefore ruled out by the algorithm as a possible site of tumour origin. Grouped by gender, results showed a sensitivity of 86.3% (44/51; 95% Cl 73.74% to 94.3%) for female samples and 81.5% (44/54; 95% Cl 68.57% to 90.75%) for male samples.


The overall CV results calculated with Cts from the three normalizers and the three QC genes from the four randomly selected metastatic FFPE samples were 4.06% (95% CI 3.63% to 4.48%) for inter-assay, 3.73% (95% CI 3.30%–4.16%) for intra-assay and 3.56% (95% CI 3.08% to 4.05%) for intra-lot observations. For the CV of each sample, see online supplementary table S6. Considering the final classification obtained by the gene-expression classifier for each replicate, results showed an overall precision of 97.22% (35/36; 95% Cl 85.47 to 99.93). The only incorrect classification was a single replicate from sample #56, for which the expected classification (kidney) was ranked in the fourth position. Samples #19 (liver), #52 (ovary) and #58 (thyroid) showed 100% of precision and 8, 9 and 9 replicates from each sample (9 replicates per sample), respectively, were ranked in the first position (table 3).

Table 3

Reproducibility of the gene-expression classifier 


One of our main goals with this study was to assess whether the information provided by a gene-expression classifier could be useful in clinical practice. The 25 cancer types (superclasses) that compose the final version of our RefDB were established after (1) excluding tumour types that do not generate metastasis and (2) agglutination or split of cancer types in order to create superclasses which make sense from the clinical perspective. For example, we excluded metastatic meningioma tumours, which represent only 0.1% of meningioma cases18 and male breast cancer because it only represents approximately 1% of all breast cancers.19 Although we were not able to determine the exact proportion of cases of metastatic male breast cancers, we believe that metastatic cases of male breast cancer that cannot be classified by IHC are very rare. For this reason, we did not include them in our RefDB. Likewise, metastatic tumours from the brain are rarely observed. A large study that classified 10 062 metastases from 4012 autopsies could not identify a single case of metastasis from the brain.20 For this reason, these cancer types were excluded from our final RefDB as they would most likely not add information to our algorithm but rather an additional level of confusion. Some cancer types that commonly have different outcomes depending on their subtype were split into different superclasses, such as germ cell tumours (split into non-seminomatous and seminomatous superclasses) and lung cancers (split into adenocarcinoma/large cell carcinoma, into small cell carcinoma superclasses and the carcinoid subtype used to compose the neuroendocrine superclass).

Another aspect that may be useful for diagnostic purposes is the RefDB final report, which indicates the three most likely potential cancer origins of a metastatic sample. The report informs the top three ranked superclasses and their respective probabilities. In most CUP cases, patients are submitted to many other tests that may rule out some sites of origin. If the report considers only the top ranked superclass as a possible tumour origin, it would be too restrictive and not very informative for cases in which the suggested tumour origin had already been excluded by other tests. Furthermore, in some cases, the clinical outcome of different tumour types could be similar. The information in the report should be combined to that previously collected with other tests as together these may provide the key to better clinical management. A good example are the three metastatic uterus FFPE samples in the validation sample set that were all correctly classified as ‘uterus’ in the second position of the final classification ranking. In the first position, the algorithm classified the three samples as ‘ovary’, a closely related cancer type. On the other hand, all the four metastatic ovary FFPE samples tested were correctly classified as ‘ovary’ in the first position and as ‘uterus’ in the second position. In these cases, the superclasses ranked in the first and in the second positions constitute very informative data when analysed together, pointing to gynaecological cancers. Considering that metastasis is seen in advanced stages of cancer, the possible treatments recommended for these cases would probably follow similar protocols.

The use of reports containing the top three ranking cancer types provides additional knowledge and warrants further investigations. The metastatic thyroid FFPE samples #24 and #45 are good examples of this. Both samples were correctly classified as ‘thyroid’ in the first position, but they were also classified as ‘melanoma’ in the second and third positions, respectively. Even though doubts may still remain as to the correct cancer origin, these results point to the investigation of the BRAF V600E mutation status as this mutation has already been described both in thyroid21 and melanoma cancers.22 Depending on the results, the use of kinase inhibitor targeted therapies, such as vemurafenib, might be indicated.23 24

Although the gene-expression classifier considers only cancer superclasses for sample classification, 58 subclasses were agglutinated to form these superclasses, combining tumour subtypes with similar clinical outcomes or histological architecture within the same superclass. To our knowledge, this is not only the largest RefDB already described, including data from 4429 tumour samples, but the first RefDB that is 100% composed of microarray sample files obtained from public platforms available online.

The choice of using this model of RefDB, instead of obtaining, preparing and analysing our own set of samples and generating our own microarray data, offers some advantages: first, it offers the possibility of having a sample collection that is highly heterogeneous, containing samples collected in many countries around the world by >100 research groups. This fact alone improves the chances that the collection is not specific to only certain populations. Second, it increases the number of samples of cancer types that are usually very rare in tumour biobanks (eg, thymus and salivary gland tumours). Finally, this approach saved us an enormous amount of time and resources of all kinds by avoiding production of data that is already publicly available.

However, we did find some problems when working with the publicly available files as not all 95 selected genes were represented in all samples. Additionally, even after exclusion of sample files with missing data on any of the 95 genes of interest, our RefDB did not perform an acceptable crosstalk with the real-time PCR data through the gene-expression classifier, indicating that a large number of poor quality samples were present in our RefDB composition. The choice for QC genes and the generation of QC parameters, according to data obtained by metastatic FFPE samples (AFFPE, BFFPE and CFFPE), and their use against the microarray data (AARRAY, BARRAY and CARRAY), were the key to improve the gene-expression classifier performance.

Considering that none of the 105 metastatic FFPE validation samples was used for algorithm training, the 83.8% overall sensitivity observed demonstrates that we obtained a high-quality and effective crosstalk between microarray and real-time PCR data since the overall sensitivity of the RefDB itself, by 10-fold cross-validation, was 86.6%. Another novel approach introduced by our study is the algorithm’s capacity to consider the gender of the patient to whom the classified sample belongs. This approach improved our probability rates.

Some technical challenges had to be overcome during this study. For instance, FFPE is known for its low preservation of mRNA integrity. Nevertheless, FFPE is by far the most used material in clinical routine as it allows for a range of different exams to be performed in traditional workups and is easily manipulated, transported and stored. The average RIN (the gold-standard parameter of RNA quality evaluation) obtained through the extracted metastatic FFPE samples was 2.27—on a scale of 0 (worst) to 10 (best)—which means the mRNA was highly degraded. In addition to the low quality, low quantities were retrieved during the extraction step, which led us to use whole transcriptome amplifications prior to cDNA synthesis and then use real-time PCR assays which targets to as short as possible amplicons. The average size of the amplicons for all 95 genes was 81 base pairs. This approach allowed us to include samples with a RIN as low as 1.4, the lowest among all extracted samples (sample #47), which was correctly classified and ranked in the first position.

Although metastatic tumours with the same primary organ of origin may exhibit different phenotypes depending on the molecular interactions with the microenvironment of the secondary organ they colonise (which may be reflected in each tumour’s global gene expression profile),25 our results show that the profiles defined by our panel of 95 genes are in some way retained in most of the 105 cases that we tested. For example, the 12 metastatic FFPE samples with a reference diagnosis of ‘melanoma’ were taken from different organs and sites, including soft tissue, peritoneum, ovaries, lymph nodes and skin. Nevertheless, the algorithm correctly classified all 12 samples, and 9 of these were ranked in the first position. The same was observed with 13 samples derived from liver, lung and lymph node metastases from intestinal cancers, which were all correctly classified, with 11 correct classifications ranked in the first position.

Our study has some limitations. For instance, we cannot assure the reference diagnosis of the microarray files used to build our RefDB. Although the data we used come from sound experiments, misclassification errors are inherent to the IHC workflow, so it cannot be said with 100% certainty that all 4429 microarray files come from samples that were correctly ascribed to the different superclasses. Additionally, we had fewer metastatic FFPE samples than is desirable on our validation set. However, again, considering the global performance of the gene-expression classifier results obtained with the 105 validation samples (83.8%) and the similarity with the overall performance results obtained with the 10-fold cross-validation of the 4429 microarray samples from the RefDB (86.6%), the profiles of the 25 superclasses generated by the algorithm are very specific and indicate high agreement between the microarray and real-time PCR data.


In conclusion, we developed a new way of classifying metastatic tumour samples by a complete and integrated workflow. The workflow starts with the artisanal manipulation of metastatic FFPE samples followed by gene expression analysis in a process controlled by QC parameters and the development of a Random Forest-based proprietary algorithm that extracts specific patterns from our curated RefDB, with a very high reproducibility. We expect that our strategy, which offers solutions for an unmet clinical need, could become an important tool to be used in combination with other current techniques, as the recent literature suggests.26 Ultimately, we hope that the approach here presented could help physicians to provide better outcomes for any selected patient with cancer who could be positively impacted by a less subjective workup.

Take home messages

  • To the best of our knowledge, this is the largest Reference Database reported to date for molecular classification of cancer, with 4429 samples divided into 25 cancer types.

  • The sample selection based on the quality control parameters described is crucial to have an aligned microarray and qPCR crosstalk in order to achieve the correct metastatic sample classification.

  • Our gene-expression classifier can assist and complement non-molecular techniques to identify the primary site of metastatic tumours with high accuracy.

Abstract translation

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


The authors thank Pedro Bordeaux-Rego for his contribution since his input on the initial study plan until his great help to obtain funding sources for this project. Dr Luis Carlos Lorenzi for mediating the contact with Barretos Cancer Hospital; Dr Rui Reis, Dr Marcia Silveira, Dr Adriane Evangelista, Augusto Marino and Renato Oliveira for their expert technical assistance. Dr Edgar Rizzatti and Dr Jeane Tsutsui to manage the project inside Fleury Group. Rafael Paes for figures amendments. Finally, and most importantly, the authors would like to thank the 4556 patients who made this study possible.


View Abstract


  • Handling editor Runjan Chetty.

  • Contributors MTdS and ALC designed and supervised the study. MTdS performed all the experiments and drafted the manuscript. FMC selected all metastatic FFPE samples. BFdS and RdOV contributed with bioinformatics. CS-N and CRV contributed as pathologists. FMC and ALC contributed as clinical oncologists. All authors read and approved the final version.

  • Funding This study was supported by FINEP (02.12.0223.00).

  • Competing interests MTdS holds equity at ONKOS Molecular Diagnostics but does not serve as a consultant or holds equity or equity options at Fleury Group. All other authors have no competing interests to declare.

  • Ethics approval The study was approved by the Fleury Group and Barretos Cancer Hospital institution-specific investigational ethics committees’ boards and assigned as CAAE 10670112.0.0000.5437 and conducted according to the Declaration of Helsinki.

  • Provenance and peer review Not commissioned; internally peer reviewed.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.