pith. machine review for the scientific record. sign in

arxiv: 2604.06208 · v1 · submitted 2026-03-16 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Extracting Breast Cancer Phenotypes from Clinical Notes: Comparing LLMs with Classical Ontology Methods

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords breast cancerphenotype extractionclinical noteslarge language modelsontology methodsinformation extractiononcologyEMR
0
0 comments X

The pith

An LLM-based framework extracts breast cancer phenotypes from clinical notes with accuracy comparable to classical ontology methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an LLM-based information extraction framework to pull phenotypes such as chemotherapy outcomes, biomarkers, tumor locations, sizes, and growth patterns from unstructured oncology provider notes. It applies the framework to breast cancer cases and compares its performance directly against earlier knowledge-driven systems that pair annotation with the NCIt Ontology Annotator. The central result is that the LLM approach reaches similar accuracy levels and can be fine-tuned for other cancer types once trained. Readers would care because the majority of detailed clinical data resides in natural-language notes rather than structured EMR fields, so an adaptable extraction method could unlock broader analysis without repeated ontology construction for each disease.

Core claim

The paper claims that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods, demonstrated on breast cancer phenotypes from provider notes, while noting that trained models can be fine-tuned to handle other cancer types and diseases.

What carries the argument

The LLM-based information extraction framework that processes unstructured provider notes to identify and extract medical phenotypes, directly benchmarked against the NCIt Ontology Annotator.

If this is right

  • Phenotype data trapped in free-text notes becomes extractable at scale without building separate ontologies for each disease.
  • Fine-tuning a trained LLM allows quick extension to additional cancer types with limited new engineering.
  • Larger clinical studies can incorporate more treatment outcome and biomarker details from natural language records.
  • EMR systems gain a practical alternative to rule-based annotation for ongoing phenotype capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLMs could reduce dependence on manually curated medical ontologies for information extraction tasks.
  • Integration into live EMR workflows might enable real-time phenotype tracking to support treatment decisions.
  • Performance on varied note styles across institutions would need separate validation to confirm broad applicability.

Load-bearing premise

The evaluation dataset of clinical notes fairly represents real-world variability so that accuracy comparisons between the LLM framework and ontology methods are reliable.

What would settle it

A test on a new collection of diverse clinical notes from multiple hospitals where the LLM accuracy falls substantially below the ontology-based method would falsify the comparability result.

Figures

Figures reproduced from arXiv: 2604.06208 by Abdullah Bin Faiz, Arbaz Khan Shehzad, Asad Afzal, Maryam Noor Awan, Momin Tariq, Muddassar Farooq, Muhammad Siddiqi, Muhammad Usamah Shahid.

Figure 1
Figure 1. Figure 1: Complete pipeline to extract JSON information from unstructured text. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of token lengths (in tokens) of the breast cancer notes, [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

A significant amount of data held in Oncology Electronic Medical Records (EMRs) is contained in unstructured provider notes -- including but not limited to the chemotherapy (or cancer treatment) outcome, different biomarkers, the tumor's location, sizes, and growth patterns of a patient. The clinical studies show that the majority of oncologists are comfortable providing these valuable insights in their notes in a natural language rather than the relevant structured fields of an EMR. The major contribution of this research is to report an LLM-based framework to process provider notes and extract valuable medical knowledge and phenotype mentioned above, with a focus on the domain of oncology. In this paper, we focus on extracting phenotypes related to breast cancer using our LLM framework, and then compare its performance with earlier works that used knowledge-driven annotation system, paired with the NCIt Ontology Annotator. The results of the study show that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods. However, once trained, they could be easily fine-tuned to cater for other cancer types and diseases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an LLM-based information extraction framework for extracting breast cancer phenotypes (e.g., chemotherapy outcomes, biomarkers, tumor location and size) from unstructured oncology EMR provider notes. It compares this approach to a classical knowledge-driven method using the NCIt Ontology Annotator and claims that the LLM framework achieves comparable accuracy while being easily adaptable to other cancer types after initial training.

Significance. If the empirical comparison is supported by rigorous quantitative evaluation on identical data, the result would be significant for clinical NLP: it would demonstrate that LLMs can serve as a flexible, low-engineering alternative to ontology-based phenotype extraction, with potential for rapid domain adaptation across diseases.

major comments (2)
  1. [Abstract] Abstract: The central claim that the LLM framework extracts phenotypes 'with an accuracy that is comparable' to the NCIt ontology annotator supplies no precision, recall, F1, or other metrics, no sample size (number of notes or phenotypes), no description of gold-standard creation, and no statement that both systems were evaluated on the same held-out notes using identical phenotype definitions. This renders the comparability assertion unverifiable.
  2. [Methods/Results] Methods/Results: The manuscript does not specify the LLM model, prompting strategy, fine-tuning details, or evaluation protocol (e.g., inter-annotator agreement, exact phenotype list, train/test split), all of which are load-bearing for reproducing the adaptation claim and for confirming that the comparison is fair.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'extract valuable medical knowledge and phenotype mentioned above' is grammatically awkward and the referent of 'mentioned above' is unclear.
  2. [Results] The manuscript would benefit from an explicit table listing the exact phenotypes extracted and the corresponding performance numbers for both methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important gaps in reporting that we have addressed through revisions to improve verifiability and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the LLM framework extracts phenotypes 'with an accuracy that is comparable' to the NCIt ontology annotator supplies no precision, recall, F1, or other metrics, no sample size (number of notes or phenotypes), no description of gold-standard creation, and no statement that both systems were evaluated on the same held-out notes using identical phenotype definitions. This renders the comparability assertion unverifiable.

    Authors: We agree that the original abstract did not supply sufficient quantitative detail to support the comparability claim. The revised abstract now reports precision, recall, and F1 scores for both approaches, the total sample size (notes and phenotypes), the gold-standard creation process (including annotator qualifications and agreement), and an explicit statement that both the LLM framework and NCIt annotator were evaluated on the identical held-out test set using the same phenotype definitions. revision: yes

  2. Referee: [Methods/Results] Methods/Results: The manuscript does not specify the LLM model, prompting strategy, fine-tuning details, or evaluation protocol (e.g., inter-annotator agreement, exact phenotype list, train/test split), all of which are load-bearing for reproducing the adaptation claim and for confirming that the comparison is fair.

    Authors: We agree that these implementation and evaluation details were missing from the original submission and are necessary for assessing reproducibility and fairness of the comparison. The revised manuscript now specifies the LLM model, prompting strategy, whether fine-tuning was used, the inter-annotator agreement for the gold standard, the complete list of phenotypes, and the train/test split. These additions allow readers to reproduce the experiments and confirm that the two systems were compared under identical conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparison study

full rationale

The paper is an empirical comparison of an LLM-based information extraction framework against classical ontology-based methods (NCIt annotator) for breast cancer phenotypes from clinical notes. It contains no equations, parameter fittings, derivations, or self-citations that reduce the central comparability claim to inputs by construction. The reported accuracy comparability is presented as an experimental outcome rather than a self-definitional or fitted-input result. No load-bearing premises rely on prior author work for uniqueness or ansatz smuggling. The study is self-contained as a direct experimental evaluation without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that clinical notes contain extractable phenotype information in natural language and that LLM outputs can be meaningfully compared to ontology annotations without additional validation layers specified.

axioms (1)
  • domain assumption Clinical notes contain extractable phenotype information in natural language.
    Invoked as the basis for both LLM and ontology extraction tasks.

pith-pipeline@v0.9.0 · 5527 in / 1061 out tokens · 40003 ms · 2026-05-15T10:59:10.542008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    arXiv preprint arXiv:1903.10676 (2019)

    Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)

  2. [2]

    arXiv preprint arXiv:2212.05238 (2022)

    Dunn, A., Dagdelen, J., Walker, N., Lee, S., Rosen, A.S., Ceder, G., Persson, K., Jain, A.: Structured information extraction from complex scientific text with fine- tuned large language models. arXiv preprint arXiv:2212.05238 (2022)

  3. [3]

    D’Souza, J., Ng, V.: Sieve-based entity linking for the biomedical domain. In: Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 297–302 (2015)

  4. [4]

    Journal of web semantics1(1), 75–80 (2003)

    Golbeck, J., Fragoso, G., Hartel, F., Hendler, J., Oberthaler, J., Parsia, B.: The national cancer institute’s thesaurus and ontology. Journal of web semantics1(1), 75–80 (2003)

  5. [5]

    npj Digital Medicine7(1), 106 (2024)

    Huang, J., Yang, D.M., Rong, R., Nezafati, K., Treager, C., Chi, Z., Wang, S., Cheng, X., Guo, Y., Klesse, L.J., et al.: A critical assessment of using chatgpt for extracting structured data from clinical notes. npj Digital Medicine7(1), 106 (2024)

  6. [6]

    Bioinformatics36(4), 1234–1240 (2020)

    Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)

  7. [7]

    JCO clinical cancer informatics2, 1–14 (2018)

    Lin, F.P., Groza, T., Kocbek, S., Antezana, E., Epstein, R.J.: Cancer care treat- ment outcome ontology: a novel computable ontology for profiling treatment out- comes in patients with solid tumors. JCO clinical cancer informatics2, 1–14 (2018)

  8. [8]

    medRxiv pp

    Simmons, A., Takkavatakarn, K., McDougal, M., Dilcher, B., Pincavitch, J., Mead- ows, L., Kauffman, J., Klang, E., Wig, R., Smith, G., et al.: Benchmarking large language models for extraction of international classification of diseases codes from clinical documentation. medRxiv pp. 2024–04 (2024)

  9. [9]

    In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    Wang, H., Zheng, J.G., Ma, X., Fox, P., Ji, H.: Language and domain independent entity linking with quantified collective validation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 695–704 (2015)

  10. [10]

    arXiv preprint arXiv:2402.05125 (2024)

    Wornow, M., Lozano, A., Dash, D., Jindal, J., Mahaffey, K.W., Shah, N.H.: Zero- shot clinical trial patient matching with llms. arXiv preprint arXiv:2402.05125 (2024)

  11. [11]

    BMC medical informatics and decision making15, 1–9 (2015) A

    Zheng, J.G., Howsmon, D., Zhang, B., Hahn, J., McGuinness, D., Hendler, J., Ji, H.: Entity linking for biomedical literature. BMC medical informatics and decision making15, 1–9 (2015) A. TOKEN LENGTH OF BREAST CANCER NOTES 11 A Token Length of Breast Cancer Notes Fig. 2: Distribution of token lengths (in tokens) of the breast cancer notes, sampled across ...