Article Text

Download PDFPDF
Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing
  1. Ruben Geevarghese1,
  2. Carlie Sigel2,
  3. John Cadley3,
  4. Subrata Chatterjee3,
  5. Pulkit Jain3,
  6. Alex Hollingsworth3,
  7. Avijit Chatterjee3,
  8. Nathaniel Swinburne4,
  9. Khawaja Hasan Bilal2,
  10. Brett Marinelli1
  1. 1Division of Interventional Radiology, Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
  2. 2Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
  3. 3Artificial Intelligence & Machine Learning, Digital, Informatics and Technology Solutions (DigITs), Memorial Sloan Kettering Cancer Center, New York, New York, USA, New York, New York, USA
  4. 4Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
  1. Correspondence to Dr Brett Marinelli; marinelb{at}mskcc.org

Abstract

Aims Structured reporting in pathology is not universally adopted and extracting elements essential to research often requires expensive and time-intensive manual curation. The accuracy and feasibility of using large language models (LLMs) to extract essential pathology elements, for cancer research is examined here.

Methods Retrospective study of patients who underwent pathology sampling for suspected hepatocellular carcinoma and underwent Ytrrium-90 embolisation. Five pathology report elements of interest were included for evaluation. LLMs (Generative Pre-trained Transformer (GPT) 3.5 turbo and GPT-4) were used to extract elements of interest. For comparison, a rules-based, regular expressions (REGEX) approach was devised for extraction. Accuracy for each approach was calculated.

Results 88 pathology reports were identified. LLMs and REGEX were both able to extract research elements with high accuracy (average 84.1%–94.8%).

Conclusions LLMs have significant potential to simplify the extraction of research elements from pathology reporting, and therefore, accelerate the pace of cancer research.

  • Artificial Intelligence
  • LIVER
  • Liver Neoplasms

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Footnotes

  • Handling editor Vikram Deshpande.

  • Contributors RG contributed to conceptualisation, methodology, formal analysis, investigation, data curation, writing–original draft, writing–reviewing and editing, and visualisation; CS contributed to writing–original draft and writing–reviewing and editing; JC contributed to conceptualisation, data curation, methodology, investigation and writing–reviewing and editing; SC performed conceptualisation, methodology, provided resources and contributed to writing–reviewing and editing; PJ, AH and KHB contributed to writing–reviewing and editing; AC provided resources and contributed to writing–reviewing and editing; BM contributed to conceptualisation, methodology, data curation, writing–original draft, writing–reviewing and editing, visualisation, supervision and project administration; All authors read and approved the final paper. BM is the guarantor.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; internally peer reviewed.